Class SimpleCorpus
java.lang.Object
smile.nlp.SimpleCorpus
- All Implemented Interfaces:
Corpus
-
Constructor Summary
ConstructorsConstructorDescriptionConstructor.SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionAdds a document to the corpus.intReturns the average size of documents in the corpus.longReturns the number of bigrams in the corpus.bigrams()Returns the iterator over the bigrams in the corpus.intReturns the total frequency of the term in the corpus.intReturns the total frequency of the bigram in the corpus.intdocCount()Returns the number of documents in the corpus.Returns the iterator over the set of documents containing the given term.search(RelevanceRanker ranker, String term) Returns the iterator over the set of documents containing the given term in descending order of relevance.search(RelevanceRanker ranker, String[] terms) Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.longsize()Returns the number of words in the corpus.intReturns the number of unique terms in the corpus.terms()Returns the iterator over the terms in the corpus.
-
Constructor Details
-
SimpleCorpus
public SimpleCorpus()Constructor. -
SimpleCorpus
public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor.- Parameters:
splitter- the sentence splitter.tokenizer- the word tokenizer.stopWords- the set of stop words to exclude.punctuations- the set of punctuation marks to exclude. Set to null to keep all punctuation marks.
-
-
Method Details
-
add
-
size
-
docCount
-
termCount
-
bigramCount
public long bigramCount()Description copied from interface:CorpusReturns the number of bigrams in the corpus.- Specified by:
bigramCountin interfaceCorpus- Returns:
- the number of bigrams in the corpus.
-
avgDocSize
public int avgDocSize()Description copied from interface:CorpusReturns the average size of documents in the corpus.- Specified by:
avgDocSizein interfaceCorpus- Returns:
- the average size of documents in the corpus.
-
count
-
count
-
terms
-
bigrams
-
search
-
search
Description copied from interface:CorpusReturns the iterator over the set of documents containing the given term in descending order of relevance. -
search
Description copied from interface:CorpusReturns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
-