Class PerfectHash

java.lang.Object
smile.hash.PerfectHash
All Implemented Interfaces:
Serializable

public class PerfectHash extends Object implements Serializable
A perfect hash of an array of strings to their index in the array.

What is a perfect hash?

A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to implement a lookup table with constant worst-case access time.

Hash function

Given a keyword w of length L, the hash value is computed as:
  h(w) = (L + sum_i kvals[w[i] - min]) mod tsize
where
  • w[i] is the i-th character of w (or only the characters at the user-supplied select positions when provided),
  • min is the smallest character value that appears in any keyword,
  • kvals is an integer array indexed by (character - min), and
  • tsize is the size of the hash table.
Lookup is O(L) in the length of the query string and O(1) in the number of keywords.

Construction algorithm

The construction uses a randomized character-value approach:
  1. Character frequency analysis. The frequency of every character across all keywords is counted. Keywords are sorted in descending order of their aggregate character frequency so that the most-constrained keys are placed first.
  2. Table-size selection. The initial table size tsize is set to ceil(1.5 * n), where n is the number of keywords, giving a load factor of about 2/3.
  3. Random assignment. Each entry of kvals is assigned a uniformly random integer in [0, tsize). All n hash values h(w) mod tsize are then checked for collisions.
  4. Retry. If any two keywords collide, a new random kvals assignment is drawn and step 3 is repeated. At load factor 2/3 the probability that a random assignment is collision-free is at least e^{-n^2/(2*tsize)} ≈ e^{-3n/4}, so the expected number of trials is O(1) per construction.
  5. Table growth. If no collision-free assignment is found within 1 000 attempts, tsize is incremented by one and the random search restarts from scratch. This ensures guaranteed termination.
  6. Table construction. Once a collision-free kvals is found, the final lookup table of size tsize is filled with keyword indices. Slots that correspond to no keyword are left as -1.
Overall expected construction time is O(n * |alphabet|).
See Also:
  • Constructor Details

    • PerfectHash

      public PerfectHash(String... keywords)
      Constructor.
      Parameters:
      keywords - the fixed set of keywords.
    • PerfectHash

      public PerfectHash(int[] select, String... keywords)
      Constructor.
      Parameters:
      select - The character positions in keywords used to calculate the hash.
      keywords - the fixed set of keywords.
  • Method Details

    • get

      public int get(String key)
      Returns the index of a keyword. If the string is not in the set, returns -1.
      Parameters:
      key - the keyword.
      Returns:
      the index of keyword or -1.