Class BagOfWords

java.lang.Object
smile.feature.extraction.BagOfWords
All Implemented Interfaces:
Serializable, Function<Tuple,Tuple>, Transform

public class BagOfWords extends Object implements Transform
The bag-of-words feature of text used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
See Also:
  • Constructor Details

    • BagOfWords

      public BagOfWords(Function<String,String[]> tokenizer, String[] words)
      Constructor.
      Parameters:
      tokenizer - the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.
      words - the list of feature words.
    • BagOfWords

      public BagOfWords(String[] columns, Function<String,String[]> tokenizer, String[] words, boolean binary)
      Constructor.
      Parameters:
      columns - the input text fields in a data frame.
      tokenizer - the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.
      words - the list of feature words. The feature words should be unique in the list. Note that the Bag class doesn't learn the features, but just use them as attributes.
      binary - true to check if feature object appear in a collection instead of their frequencies.
  • Method Details

    • features

      public String[] features()
      Returns the feature words.
      Returns:
      the feature words.
    • fit

      public static BagOfWords fit(DataFrame data, Function<String,String[]> tokenizer, int k, String... columns)
      Learns a vocabulary dictionary of top-k frequent tokens in the raw documents.
      Parameters:
      data - training data.
      tokenizer - the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.
      k - the limit of vocabulary size.
      columns - the text columns.
      Returns:
      the model.
    • apply

      public Tuple apply(Tuple x)
      Specified by:
      apply in interface Function<Tuple,Tuple>
    • apply

      public int[] apply(String text)
      Returns the bag-of-words features of a document.
      Parameters:
      text - a document.
      Returns:
      the feature vector.