Class TFIDF
- All Implemented Interfaces:
- RelevanceRanker
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let tfmax(d) be the maximum tf over all terms in d. Then, we compute a normalized term frequency for each term t in document d by
tf = a + (1? a) tft,d / tfmax(d)
where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in d. The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. Maximum tf normalization does suffer from the following issues:
- The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
- A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
- More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
- See Also:
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptiondoublerank(int tf, int maxtf, long N, long n) Returns the relevance score between a term and a document based on a corpus.doubleReturns the relevance score between a set of terms and a document based on a corpus.doubleReturns the relevance score between a term and a document based on a corpus.
- 
Constructor Details- 
TFIDFpublic TFIDF()Constructor.
- 
TFIDFpublic TFIDF(double smoothing) Constructor.- Parameters:
- smoothing- the smoothing parameter in maximum tf normalization.
 
 
- 
- 
Method Details- 
rankpublic double rank(int tf, int maxtf, long N, long n) Returns the relevance score between a term and a document based on a corpus.- Parameters:
- tf- the frequency of searching term in the document to rank.
- maxtf- the maximum frequency over all terms in the document.
- N- the number of documents in the corpus.
- n- the number of documents containing the given term in the corpus;
- Returns:
- the relevance score.
 
- 
rankDescription copied from interface:RelevanceRankerReturns the relevance score between a term and a document based on a corpus.- Specified by:
- rankin interface- RelevanceRanker
- Parameters:
- corpus- the corpus.
- doc- the document to rank.
- term- the searching term.
- tf- the term frequency in the document.
- n- the number of documents containing the given term in the corpus;
- Returns:
- the relevance score.
 
- 
rankDescription copied from interface:RelevanceRankerReturns the relevance score between a set of terms and a document based on a corpus.- Specified by:
- rankin interface- RelevanceRanker
- Parameters:
- corpus- the corpus.
- doc- the document to rank.
- terms- the searching terms.
- tf- the term frequencies in the document.
- n- the number of documents containing the given term in the corpus;
- Returns:
- the relevance score.
 
 
-