smile.nlp.relevance.TFIDF

All Implemented Interfaces:: RelevanceRanker

public class TFIDF extends Object implements RelevanceRanker

The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let tfmax(d) be the maximum tf over all terms in d. Then, we compute a normalized term frequency for each term t in document d by

tf = a + (1? a) tf_t,d / tfmax(d)

where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in d. The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. Maximum tf normalization does suffer from the following issues:

The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.

See Also:

Constructor Summary

Constructors

Constructor

Description

TFIDF()

Constructor.

TFIDF(double smoothing)

Constructor.
Method Summary

Modifier and Type

Method

Description

double

rank(int tf, int maxtf, long N, long n)

Returns the relevance score between a term and a document based on a corpus.

double

rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)

Returns the relevance score between a set of terms and a document based on a corpus.

double

rank(Corpus corpus, TextTerms doc, String term, int tf, int n)

Returns the relevance score between a term and a document based on a corpus.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TFIDF
  
  public TFIDF()
  
  Constructor.
- TFIDF
  
  public TFIDF(double smoothing)
  
  Constructor.
  
  Parameters:
  
  smoothing - the smoothing parameter in maximum tf normalization.
Method Details
- rank
  
  public double rank(int tf, int maxtf, long N, long n)
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Parameters:
  
  tf - the frequency of searching term in the document to rank.
  
  maxtf - the maximum frequency over all terms in the document.
  
  N - the number of documents in the corpus.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- rank
  
  public double rank(Corpus corpus, TextTerms doc, String term, int tf, int n)
  
  Description copied from interface: RelevanceRanker
  
  Returns the relevance score between a term and a document based on a corpus.
  
  Specified by:
  
  rank in interface RelevanceRanker
  
  Parameters:
  
  corpus - the corpus.
  
  doc - the document to rank.
  
  term - the searching term.
  
  tf - the term frequency in the document.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.
- rank
  
  public double rank(Corpus corpus, TextTerms doc, String[] terms, int[] tf, int n)
  
  Description copied from interface: RelevanceRanker
  
  Returns the relevance score between a set of terms and a document based on a corpus.
  
  Specified by:
  
  rank in interface RelevanceRanker
  
  Parameters:
  
  corpus - the corpus.
  
  doc - the document to rank.
  
  terms - the searching terms.
  
  tf - the term frequencies in the document.
  
  n - the number of documents containing the given term in the corpus;
  
  Returns:
  
  the relevance score.

Class TFIDF

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TFIDF

TFIDF

Method Details

rank

rank

rank