Class SimpleCorpus

java.lang.Object
smile.nlp.SimpleCorpus
All Implemented Interfaces:
Corpus

public class SimpleCorpus extends Object implements Corpus
An in-memory document corpus. Useful for text feature engineering.
  • Constructor Details

    • SimpleCorpus

      public SimpleCorpus()
      Constructor.
    • SimpleCorpus

      public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations)
      Constructor.
      Parameters:
      splitter - the sentence splitter.
      tokenizer - the word tokenizer.
      stopWords - the set of stop words to exclude.
      punctuations - the set of punctuation marks to exclude. Set to null to keep all punctuation marks.
  • Method Details

    • doc

      public Document doc(String text)
      Creates a document with corpus's tokenizer, stop word filter, etc.
      Parameters:
      text - the text content.
      Returns:
      the document with unique id.
    • doc

      public Document doc(Text text)
      Creates a document with corpus's tokenizer, stop word filter, etc.
      Parameters:
      text - the text.
      Returns:
      the document with unique id.
    • add

      public void add(Document doc)
      Adds a document to the corpus.
      Parameters:
      doc - the document.
    • size

      public long size()
      Description copied from interface: Corpus
      Returns the number of words in the corpus.
      Specified by:
      size in interface Corpus
      Returns:
      the number of words in the corpus.
    • docCount

      public int docCount()
      Description copied from interface: Corpus
      Returns the number of documents in the corpus.
      Specified by:
      docCount in interface Corpus
      Returns:
      the number of documents in the corpus.
    • termCount

      public int termCount()
      Description copied from interface: Corpus
      Returns the number of unique terms in the corpus.
      Specified by:
      termCount in interface Corpus
      Returns:
      the number of unique terms in the corpus.
    • bigramCount

      public long bigramCount()
      Description copied from interface: Corpus
      Returns the number of bigrams in the corpus.
      Specified by:
      bigramCount in interface Corpus
      Returns:
      the number of bigrams in the corpus.
    • avgDocSize

      public int avgDocSize()
      Description copied from interface: Corpus
      Returns the average size of documents in the corpus.
      Specified by:
      avgDocSize in interface Corpus
      Returns:
      the average size of documents in the corpus.
    • count

      public int count(String term)
      Description copied from interface: Corpus
      Returns the total frequency of the term in the corpus.
      Specified by:
      count in interface Corpus
      Parameters:
      term - the term.
      Returns:
      the total frequency of the term in the corpus.
    • count

      public int count(Bigram bigram)
      Description copied from interface: Corpus
      Returns the total frequency of the bigram in the corpus.
      Specified by:
      count in interface Corpus
      Parameters:
      bigram - the bigram.
      Returns:
      the total frequency of the bigram in the corpus.
    • terms

      public Iterator<String> terms()
      Description copied from interface: Corpus
      Returns the iterator over the terms in the corpus.
      Specified by:
      terms in interface Corpus
      Returns:
      the iterator of terms.
    • bigrams

      public Iterator<Bigram> bigrams()
      Description copied from interface: Corpus
      Returns the iterator over the bigrams in the corpus.
      Specified by:
      bigrams in interface Corpus
      Returns:
      the iterator of bigrams.
    • search

      public Iterator<Text> search(String term)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing the given term.
      Specified by:
      search in interface Corpus
      Parameters:
      term - the search term.
      Returns:
      the iterator of documents containing the term.
    • search

      public Iterator<Relevance> search(RelevanceRanker ranker, String term)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing the given term in descending order of relevance.
      Specified by:
      search in interface Corpus
      Parameters:
      ranker - the relevance ranker.
      term - the search term.
      Returns:
      the iterator of documents in descending order of relevance.
    • search

      public Iterator<Relevance> search(RelevanceRanker ranker, String[] terms)
      Description copied from interface: Corpus
      Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
      Specified by:
      search in interface Corpus
      Parameters:
      ranker - the relevance ranker.
      terms - the search terms.
      Returns:
      the iterator of documents in descending order of relevance.
    • bigrams

      public List<Bigram> bigrams(int k, int minFrequency)
      Finds top k bigram collocations in the corpus.
      Parameters:
      k - the top k bigram to compute.
      minFrequency - The minimum frequency of bigram in the corpus.
      Returns:
      the significant bigram collocations in the descending order of likelihood ratio.
    • bigrams

      public List<Bigram> bigrams(double p, int minFrequency)
      Finds bigram collocations in the given corpus whose p-value is less than the given threshold.
      Parameters:
      p - the p-value threshold
      minFrequency - The minimum frequency of bigram in the corpus.
      Returns:
      the significant bigram collocations in descending order of likelihood ratio.