Class BM25

java.lang.Object
smile.nlp.relevance.BM25
All Implemented Interfaces:
RelevanceRanker

public class BM25 extends Object implements RelevanceRanker
The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to term frequency and document length while not introducing too many additional parameters into the model. It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters.

At the extreme values of the coefficient b, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0). BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.

BM25 and its newer variants represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.

See Also:
  • Constructor Summary

    Constructors
    Constructor
    Description
    Default constructor with k1 = 1.2, b = 0.75, delta = 1.0.
    BM25(double k1, double b, double delta)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
    Returns the relevance score between a set of terms and a document based on a corpus.
    double
    rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
    Returns the relevance score between a term and a document based on a corpus.
    double
    score(double freq, int docSize, double avgDocSize, long N, long n)
    Returns the relevance score between a term and a document based on a corpus.
    double
    score(double freq, long N, long n)
    Returns the relevance score between a term and a document based on a corpus.
    double
    score(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n)
    Returns the relevance score between a term and a document based on a corpus.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • BM25

      public BM25()
      Default constructor with k1 = 1.2, b = 0.75, delta = 1.0.
    • BM25

      public BM25(double k1, double b, double delta)
      Constructor.
      Parameters:
      k1 - is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency.
      b - b is another tuning parameter (0 <= b <= 1) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.
      delta - the control parameter in BM25+. The standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevance to shorter documents that do not contain the query term at all.
  • Method Details

    • score

      public double score(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n)
      Returns the relevance score between a term and a document based on a corpus.
      Parameters:
      termFreq - the term frequency in the text body.
      docSize - the text length.
      avgDocSize - the average text length in the corpus.
      titleTermFreq - the term frequency in the title.
      titleSize - the title length.
      avgTitleSize - the average title length in the corpus.
      anchorTermFreq - the term frequency in the anchor.
      anchorSize - the anchor length.
      avgAnchorSize - the average anchor length in the corpus.
      N - the number of documents in the corpus.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • score

      public double score(double freq, long N, long n)
      Returns the relevance score between a term and a document based on a corpus.
      Parameters:
      freq - the normalized term frequency of searching term in the document to rank.
      N - the number of documents in the corpus.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • score

      public double score(double freq, int docSize, double avgDocSize, long N, long n)
      Returns the relevance score between a term and a document based on a corpus.
      Parameters:
      freq - the frequency of searching term in the document to rank.
      docSize - the size of document to rank.
      avgDocSize - the average size of documents in the corpus.
      N - the number of documents in the corpus.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • rank

      public double rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
      Description copied from interface: RelevanceRanker
      Returns the relevance score between a term and a document based on a corpus.
      Specified by:
      rank in interface RelevanceRanker
      Parameters:
      corpus - the corpus.
      doc - the document to rank.
      term - the searching term.
      tf - the term frequency in the document.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • rank

      public double rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
      Description copied from interface: RelevanceRanker
      Returns the relevance score between a set of terms and a document based on a corpus.
      Specified by:
      rank in interface RelevanceRanker
      Parameters:
      corpus - the corpus.
      doc - the document to rank.
      terms - the searching terms.
      tf - the term frequencies in the document.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.