Class SNLSH<K,V>

All Implemented Interfaces:
Serializable, RNNSearch<K,V>

public class SNLSH<K,V> extends Object implements RNNSearch<K,V>, Serializable
Locality-Sensitive Hashing for Signatures. LSH is an efficient algorithm for approximate nearest neighbor search in high dimensional spaces by performing probabilistic dimension reduction of data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). If we are given signatures for the sets, we may divide them into bands, and only measure the similarity of a pair of sets if they are identical in at least one band. By choosing the size of bands appropriately, we can eliminate most of the pairs that do not meet our threshold of similarity.

By default, the query object (reference equality) is excluded from the neighborhood. Note that you may observe weird behavior with String objects. JVM will pool the string literal objects. So the below variables String a = "ABC"; String b = "ABC"; String c = "AB" + "C"; are actually equal in reference test a == b == c. With toy data that you type explicitly in the code, this will cause problems. Fortunately, the data would be generally read from secondary storage in production.


  1. Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms
See Also:
  • Constructor Details

    • SNLSH

      public SNLSH(int L, SimHash<K> hash)
      L - the number of bands/hash tables.
      hash - simhash function.
  • Method Details

    • put

      public void put(K key, V value)
      Adds a new item.
      key - the key.
      value - the value.
    • search

      public void search(K q, double radius, List<Neighbor<K,V>> neighbors)
      Description copied from interface: RNNSearch
      Retrieves the neighbors in a fixed radius of query object, i.e. d(q, v) <= radius.
      Specified by:
      search in interface RNNSearch<K,V>
      q - the query key.
      radius - the radius of search range from target.
      neighbors - the list to store found neighbors on output.