Package smile.feature.extraction
Class HashEncoder
java.lang.Object
smile.feature.extraction.HashEncoder
- All Implemented Interfaces:
Function<String,
SparseArray>
Feature hashing, also known as the hashing trick, is a fast and
space-efficient way of vectorizing features, i.e. turning arbitrary
features (mostly text) into indices in a vector. It works by applying
a hash function to the features and using their hash values as indices
directly, rather than looking the indices up in an associative array.
-
Constructor Summary
ConstructorDescriptionHashEncoder
(Function<String, String[]> tokenizer, int numFeatures) Constructor.HashEncoder
(Function<String, String[]> tokenizer, int numFeatures, boolean alternateSign) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionReturns the bag-of-words features of a document.
-
Constructor Details
-
HashEncoder
Constructor.- Parameters:
tokenizer
- the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.numFeatures
- the number of features in the output space. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
-
HashEncoder
Constructor.- Parameters:
tokenizer
- the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.numFeatures
- the number of features in the output space. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.alternateSign
- When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small number of features. This approach is similar to sparse random projection.
-
-
Method Details
-
apply
Returns the bag-of-words features of a document.- Specified by:
apply
in interfaceFunction<String,
SparseArray> - Parameters:
text
- a document.- Returns:
- the sparse feature vector.
-