Constructor and Description |
---|
SimpleCorpus()
Constructor.
|
SimpleCorpus(SentenceSplitter splitter,
Tokenizer tokenizer,
StopWords stopWords,
Punctuations punctuations)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
Text |
add(Text text)
Add a document to the corpus.
|
int |
getAverageDocumentSize()
Returns the average size of documents in the corpus.
|
int |
getBigramFrequency(Bigram bigram)
Returns the total frequency of the bigram in the corpus.
|
java.util.Iterator<Bigram> |
getBigrams()
Returns an iterator over the bigrams in the corpus.
|
long |
getNumBigrams()
Returns the number of bigrams in the corpus.
|
int |
getNumDocuments()
Returns the number of documents in the corpus.
|
int |
getNumTerms()
Returns the number of unique terms in the corpus.
|
int |
getTermFrequency(java.lang.String term)
Returns the total frequency of the term in the corpus.
|
java.util.Iterator<java.lang.String> |
getTerms()
Returns an iterator over the terms in the corpus.
|
java.util.Iterator<Relevance> |
search(RelevanceRanker ranker,
java.lang.String term)
Returns an iterator over the set of documents containing the given term
in descending order of relevance.
|
java.util.Iterator<Relevance> |
search(RelevanceRanker ranker,
java.lang.String[] terms)
Returns an iterator over the set of documents containing (at least one
of) the given terms in descending order of relevance.
|
java.util.Iterator<Text> |
search(java.lang.String term)
Returns an iterator over the set of documents containing the given term.
|
long |
size()
Returns the number of words in the corpus.
|
public SimpleCorpus()
public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations)
splitter
- the sentence splitter.tokenizer
- the word tokenizer.stopWords
- the set of stop words to exclude.punctuations
- the set of punctuation marks to exclude. Set to null to keep all punctuation marks.public long size()
Corpus
public int getNumDocuments()
Corpus
getNumDocuments
in interface Corpus
public int getNumTerms()
Corpus
getNumTerms
in interface Corpus
public long getNumBigrams()
Corpus
getNumBigrams
in interface Corpus
public int getAverageDocumentSize()
Corpus
getAverageDocumentSize
in interface Corpus
public int getTermFrequency(java.lang.String term)
Corpus
getTermFrequency
in interface Corpus
public int getBigramFrequency(Bigram bigram)
Corpus
getBigramFrequency
in interface Corpus
public java.util.Iterator<java.lang.String> getTerms()
Corpus
public java.util.Iterator<Bigram> getBigrams()
Corpus
getBigrams
in interface Corpus
public java.util.Iterator<Text> search(java.lang.String term)
Corpus
public java.util.Iterator<Relevance> search(RelevanceRanker ranker, java.lang.String term)
Corpus
public java.util.Iterator<Relevance> search(RelevanceRanker ranker, java.lang.String[] terms)
Corpus