Package smile.hash

Interface SimHash<T>

Type Parameters:
T - the data type of input objects.

public interface SimHash<T>
SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages.
  • Method Summary

    Modifier and Type
    Method
    Description
    long
    hash(T x)
    Return the hash code.
    static SimHash<int[]>
    of(byte[][] features)
    Returns the SimHash for a set of generic features (represented as byte[]).
    static SimHash<String[]>
    Returns the SimHash for string tokens.
  • Method Details

    • hash

      long hash(T x)
      Return the hash code.
      Parameters:
      x - the object.
      Returns:
      the hash code.
    • of

      static SimHash<int[]> of(byte[][] features)
      Returns the SimHash for a set of generic features (represented as byte[]).
      Parameters:
      features - the generic features.
      Returns:
      the SimHash.
    • text

      static SimHash<String[]> text()
      Returns the SimHash for string tokens.
      Returns:
      the SimHash for string tokens.