smile.nlp.relevance.BM25

All Implemented Interfaces:: RelevanceRanker

public class BM25 extends Object implements RelevanceRanker

The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to term frequency and document length while not introducing too many additional parameters into the model. It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters.

At the extreme values of the coefficient b, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0). BM25F is a modification of BM25 in which the document is considered to be composed of several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.

BM25 and its newer variants represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.

See Also:

Constructor Summary

Constructors

Constructor

Description

BM25()

Default constructor with k1 = 1.2, b = 0.75, delta = 1.0.

BM25(double k1, double b, double delta)

Constructor.
Method Summary

Modifier and Type

Method

Description

double

rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)

Returns the relevance score between a set of terms and a document based on a corpus.

double

rank(Corpus corpus, TextTerms doc, String term, int tf, int n)

Returns the relevance score between a term and a document based on a corpus.

double

score(double freq, int docSize, double avgDocSize, long N, long n)

Returns the relevance score between a term and a document based on a corpus.

double

score(double freq, long N, long n)

Returns the relevance score between a term and a document based on a corpus.

double

score(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n)

Returns the relevance score between a term and a document based on a corpus.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- BM25
  
  public BM25()
  
  Default constructor with k1 = 1.2, b = 0.75, delta = 1.0.
- BM25
  
  public BM25(double k1, double b, double delta)
  
  Constructor.
  
  Parameters:
  
  k1 - is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency.
  
  b - b is another tuning parameter (0 <= b <= 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.
  
  delta - the control parameter in BM25+. The standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevance to shorter documents that do not contain the query term at all.
Method Details
- score
  
  public double score(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n)
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Parameters:
  
  termFreq - the term frequency in the text body.
  
  docSize - the text length.
  
  avgDocSize - the average text length in the corpus.
  
  titleTermFreq - the term frequency in the title.
  
  titleSize - the title length.
  
  avgTitleSize - the average title length in the corpus.
  
  anchorTermFreq - the term frequency in the anchor.
  
  anchorSize - the anchor length.
  
  avgAnchorSize - the average anchor length in the corpus.
  
  N - the number of documents in the corpus.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- score
  
  public double score(double freq, long N, long n)
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Parameters:
  
  freq - the normalized term frequency of searching term in the document to rank.
  
  N - the number of documents in the corpus.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- score
  
  public double score(double freq, int docSize, double avgDocSize, long N, long n)
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Parameters:
  
  freq - the frequency of searching term in the document to rank.
  
  docSize - the size of document to rank.
  
  avgDocSize - the average size of documents in the corpus.
  
  N - the number of documents in the corpus.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- rank
  
  public double rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
  
  Description copied from interface: RelevanceRanker
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Specified by:
  
  rank in interface RelevanceRanker
  
  Parameters:
  
  corpus - the corpus.
  
  doc - the document to rank.
  
  term - the searching term.
  
  tf - the term frequency in the document.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- rank
  
  public double rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
  
  Description copied from interface: RelevanceRanker
  
  Returns the relevance score between a set of terms and a document based on a corpus.
  
  Specified by:
  
  rank in interface RelevanceRanker
  
  Parameters:
  
  corpus - the corpus.
  
  doc - the document to rank.
  
  terms - the searching terms.
  
  tf - the term frequencies in the document.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.

Class BM25

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

BM25

BM25

Method Details

score

score

score

rank

rank