Package smile.nlp

Class SimpleCorpus

java.lang.Object
smile.nlp.SimpleCorpus
All Implemented Interfaces:
Corpus

public class SimpleCorpus extends Object implements Corpus
An in-memory text corpus. Useful for text feature engineering.
  • Constructor Details

    • SimpleCorpus

      public SimpleCorpus()
      Constructor.
    • SimpleCorpus

      public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations)
      Constructor.
      Parameters:
      splitter - the sentence splitter.
      tokenizer - the word tokenizer.
      stopWords - the set of stop words to exclude.
      punctuations - the set of punctuation marks to exclude. Set to null to keep all punctuation marks.
  • Method Details

    • add

      public Text add(Text text)
      Adds a document to the corpus.
      Parameters:
      text - the document text.
      Returns:
      the document.
    • size

      public long size()
      Description copied from interface: Corpus
      Returns the number of words in the corpus.
      Specified by:
      size in interface Corpus
      Returns:
      the number of words in the corpus.
    • ndoc

      public int ndoc()
      Description copied from interface: Corpus
      Returns the number of documents in the corpus.
      Specified by:
      ndoc in interface Corpus
      Returns:
      the number of documents in the corpus.
    • nterm

      public int nterm()
      Description copied from interface: Corpus
      Returns the number of unique terms in the corpus.
      Specified by:
      nterm in interface Corpus
      Returns:
      the number of unique terms in the corpus.
    • nbigram

      public long nbigram()
      Description copied from interface: Corpus
      Returns the number of bigrams in the corpus.
      Specified by:
      nbigram in interface Corpus
      Returns:
      the number of bigrams in the corpus.
    • avgDocSize

      public int avgDocSize()
      Description copied from interface: Corpus
      Returns the average size of documents in the corpus.
      Specified by:
      avgDocSize in interface Corpus
      Returns:
      the average size of documents in the corpus.
    • count

      public int count(String term)
      Description copied from interface: Corpus
      Returns the total frequency of the term in the corpus.
      Specified by:
      count in interface Corpus
      Parameters:
      term - the term.
      Returns:
      the total frequency of the term in the corpus.
    • count

      public int count(Bigram bigram)
      Description copied from interface: Corpus
      Returns the total frequency of the bigram in the corpus.
      Specified by:
      count in interface Corpus
      Parameters:
      bigram - the bigram.
      Returns:
      the total frequency of the bigram in the corpus.
    • terms

      public Iterator<String> terms()
      Description copied from interface: Corpus
      Returns the iterator over the terms in the corpus.
      Specified by:
      terms in interface Corpus
      Returns:
      the iterator of terms.
    • bigrams

      public Iterator<Bigram> bigrams()
      Description copied from interface: Corpus
      Returns the iterator over the bigrams in the corpus.
      Specified by:
      bigrams in interface Corpus
      Returns:
      the iterator of bigrams.
    • search

      public Iterator<Text> search(String term)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing the given term.
      Specified by:
      search in interface Corpus
      Parameters:
      term - the search term.
      Returns:
      the iterator of documents containing the term.
    • search

      public Iterator<Relevance> search(RelevanceRanker ranker, String term)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing the given term in descending order of relevance.
      Specified by:
      search in interface Corpus
      Parameters:
      ranker - the relevance ranker.
      term - the search term.
      Returns:
      the iterator of documents in descending order of relevance.
    • search

      public Iterator<Relevance> search(RelevanceRanker ranker, String[] terms)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
      Specified by:
      search in interface Corpus
      Parameters:
      ranker - the relevance ranker.
      terms - the search terms.
      Returns:
      the iterator of documents in descending order of relevance.