Package smile.feature.extraction
Class BagOfWords
java.lang.Object
smile.feature.extraction.BagOfWords
- All Implemented Interfaces:
Serializable
,Function<Tuple,
,Tuple> Transform
The bag-of-words feature of text used in natural language
processing and information retrieval. In this model, a text
(such as a sentence or a document) is represented as an
unordered collection of words, disregarding grammar and
even word order.
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionint[]
Returns the bag-of-words features of a document.String[]
features()
Returns the feature words.static BagOfWords
Learns a vocabulary dictionary of top-k frequent tokens in the raw documents.
-
Constructor Details
-
BagOfWords
Constructor.- Parameters:
tokenizer
- the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.words
- the list of feature words.
-
BagOfWords
public BagOfWords(String[] columns, Function<String, String[]> tokenizer, String[] words, boolean binary) Constructor.- Parameters:
columns
- the input text fields in a data frame.tokenizer
- the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.words
- the list of feature words. The feature words should be unique in the list. Note that the Bag class doesn't learn the features, but just use them as attributes.binary
- true to check if feature object appear in a collection instead of their frequencies.
-
-
Method Details
-
features
Returns the feature words.- Returns:
- the feature words.
-
fit
public static BagOfWords fit(DataFrame data, Function<String, String[]> tokenizer, int k, String... columns) Learns a vocabulary dictionary of top-k frequent tokens in the raw documents.- Parameters:
data
- training data.tokenizer
- the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.k
- the limit of vocabulary size.columns
- the text columns.- Returns:
- the model.
-
apply
-
apply
Returns the bag-of-words features of a document.- Parameters:
text
- a document.- Returns:
- the feature vector.
-