Class NGram

java.lang.Object
smile.nlp.NGram
smile.nlp.collocation.NGram
All Implemented Interfaces:
Comparable<NGram>

public class NGram extends NGram implements Comparable<NGram>
An n-gram is a contiguous sequence of n words from a given sequence of text. An n-gram of size 1 is referred to as an unigram; size 2 is a bigram; size 3 is a trigram.
  • Field Details

    • count

      public final int count
      The frequency of n-gram in the corpus.
  • Constructor Details

    • NGram

      public NGram(String[] words, int count)
      Constructor.
      Parameters:
      words - the n-gram word sequence.
      count - the frequency of n-gram in the corpus.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class NGram
    • compareTo

      public int compareTo(NGram o)
      Specified by:
      compareTo in interface Comparable<NGram>
    • of

      public static NGram[][] of(Collection<String[]> sentences, int maxNGramSize, int minFrequency)
      Extracts n-gram phrases by an Apiori-like algorithm. The algorithm was proposed in "A Study Using n-gram Features for Text Categorization" by Johannes Furnkranz.

      The algorithm takes a collection of sentences and generates all n-grams of length at most MaxNGramSize that occur at least MinFrequency times in the sentences.

      Parameters:
      sentences - A collection of sentences (already split).
      maxNGramSize - The maximum length of n-gram
      minFrequency - The minimum frequency of n-gram in the sentences.
      Returns:
      An array of n-gram sets. The i-th entry is the set of i-grams.