smile.neighbor

package smile.neighbor

Nearest neighbor search. Nearest neighbor search is an optimization problem for finding the closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q.

The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Surprisingly, naive search outperforms space partitioning approaches on higher dimensional spaces.

Space partitioning, a branch and bound methodology, has been applied to the problem. The simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points. Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.

In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include vp-tree and BK-tree.

Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.

The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

Related Packages

Package

Description

smile.neighbor.lsh

LSH internal classes.
Class

Description

BKTree<K,V>

A BK-tree is a metric tree specifically adapted to discrete metric spaces.

CoverTree<K,V>

Cover tree is a data structure for generic nearest neighbor search, which is especially efficient in spaces with small intrinsic dimension.

KDTree<E>

A KD-tree (short for k-dimensional tree) is a space-partitioning dataset structure for organizing points in a k-dimensional space.

KNNSearch<K,V>

Retrieves the top k nearest neighbors to the query.

LinearSearch<K,V>

Brute force linear nearest neighbor search.

LSH<E>

Locality-Sensitive Hashing.

MPLSH<E>

Multi-Probe Locality-Sensitive Hashing.

MutableLSH<E>

Mutable LSH.

Neighbor<K,V>

The immutable object encapsulates the results of nearest neighbor search.

RandomProjectionForest

A set of random projection trees.

RandomProjectionTree

Random projection trees.

RNNSearch<K,V>

Retrieves the nearest neighbors to a query in a radius.

SNLSH<K,V>

Locality-Sensitive Hashing for Signatures.

Package smile.neighbor