Class TFIDF

java.lang.Object
smile.nlp.relevance.TFIDF
All Implemented Interfaces:
RelevanceRanker

public class TFIDF extends Object implements RelevanceRanker
The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let tfmax(d) be the maximum tf over all terms in d. Then, we compute a normalized term frequency for each term t in document d by

tf = a + (1? a) tft,d / tfmax(d)

where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in d. The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. Maximum tf normalization does suffer from the following issues:

  1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
  2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
  3. More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
See Also:
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor.
    TFIDF(double smoothing)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    rank(int tf, int maxtf, long N, long n)
    Returns the relevance score between a term and a document based on a corpus.
    double
    rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
    Returns the relevance score between a set of terms and a document based on a corpus.
    double
    rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
    Returns the relevance score between a term and a document based on a corpus.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • TFIDF

      public TFIDF()
      Constructor.
    • TFIDF

      public TFIDF(double smoothing)
      Constructor.
      Parameters:
      smoothing - the smoothing parameter in maximum tf normalization.
  • Method Details

    • rank

      public double rank(int tf, int maxtf, long N, long n)
      Returns the relevance score between a term and a document based on a corpus.
      Parameters:
      tf - the frequency of searching term in the document to rank.
      maxtf - the maximum frequency over all terms in the document.
      N - the number of documents in the corpus.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • rank

      public double rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
      Description copied from interface: RelevanceRanker
      Returns the relevance score between a term and a document based on a corpus.
      Specified by:
      rank in interface RelevanceRanker
      Parameters:
      corpus - the corpus.
      doc - the document to rank.
      term - the searching term.
      tf - the term frequency in the document.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.
    • rank

      public double rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
      Description copied from interface: RelevanceRanker
      Returns the relevance score between a set of terms and a document based on a corpus.
      Specified by:
      rank in interface RelevanceRanker
      Parameters:
      corpus - the corpus.
      doc - the document to rank.
      terms - the searching terms.
      tf - the term frequencies in the document.
      n - the number of documents containing the given term in the corpus;
      Returns:
      the relevance score.