Class HashEncoder

java.lang.Object
smile.feature.extraction.HashEncoder
All Implemented Interfaces:
Function<String,SparseArray>

public class HashEncoder extends Object implements Function<String,SparseArray>
Feature hashing, also known as the hashing trick, is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features (mostly text) into indices in a vector. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.
  • Constructor Details

    • HashEncoder

      public HashEncoder(Function<String,String[]> tokenizer, int numFeatures)
      Constructor.
      Parameters:
      tokenizer - the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.
      numFeatures - the number of features in the output space. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
    • HashEncoder

      public HashEncoder(Function<String,String[]> tokenizer, int numFeatures, boolean alternateSign)
      Constructor.
      Parameters:
      tokenizer - the tokenizer of text, which may include additional processing such as filtering stop word, converting to lowercase, stemming, etc.
      numFeatures - the number of features in the output space. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
      alternateSign - When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small number of features. This approach is similar to sparse random projection.
  • Method Details