随机算法 12 Nearest Neighbor Search and Locality Sensitive Hashing
Nearest Neighbor Search
2D scenario: Voronoi Diagram
3D or higher dimensions: Not efficient.
Fine-grained complexity
Approximate nearest neighbor search
Assume the distance between query point and its nearest neighbor is , -approximate -near neighbor is all the points within range .
- input: point , constant , radium .
- if there is a point with , then return point with with probability .
Locality Sensitive Hashing
abbr. LSH
Distance: . , Manhattan, Hamming…
Hash function is -sensitive w.r.t. if
- with .
- with .
- .
Store all points in hash table with key which is -sensitive.
is linked list of all points with ,
1 | HashMap<Integer,linkedList<Point>>: hm:... |
Query:
1 | Point Query(Point q){ |
Let be if and otherwise.
Parameter , is the concatenates of bits of . are independent.
For , .
For , .
Let be if and otherwise.
Choosing
Better probability:
For parameters . For , is the concatenates of bits of . are independent.
Distance metric: Hamming distance.
for : store in hash table .
Query(q):
For
scan list :
if see with
- return p
if scan points
- abort
Space:
Query time: is time to evaluate .
Probability Bound for Failure
Two types of failure:
E_2=\{\sum_{i=1}^L|\{p:dist(p,q)\gt cR\and g_i(q)=g_i(p)\}|\gt3L\}
If none of them happens, the results is correct.
We want .
\Pr[E_2]\le\mathbb E[\sum_{i=1}^L|\{p:dist(p,q)\gt cR\and g_i(q)=g_i(p)\}|]/3L\\ =\sum_{i=1}^L....\\ \le L/3L=1/3\\ \Pr[E_1]=\Pr[\forall i:g_i(p)\neq g_i(p^*)]\\ =\prod_{i=1}^L\Pr[g_i(q)\neq g_i(p^*)]\\ =\prod_{i=1}^L(1-\Pr[g_i(q)=g_i(p^*)])\\ =(1-P_1^k)^L\\
Choosing , we will get probability bound for failure.
Upper bound for L
For hamming distance:
Digest from Reference Paper
Efficient algorithms for the approximate and exact nearest neighbor problem.
Goal: Preprocess a dataset of objects. Later, given a new query object, one can quickly return the dataset object that is most similar to the query.
Nearest Neighbor Problem
Given a collection of points, build a data structure which, given any query point, reports the data point that is closest to the query.
A point and the distance metric. It’s hard when is large.
In approximation version, the algorithm is allowed to return a point whose distance from the query is at most times the distance from the query to its nearest points.
is called the approximation factor.
One of the most popular algorithms for performing approximate search in high dimensions based on locality-sensitive hashing.
Key idea: hash the point using several hash functions to ensure that for each function, the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.
Definitions
We say point is an -near neighbor of a point if the distance between and is at most .
Definition of Randomized -approximate -near neighbor (abbr. -NN). Given a set of points in a -dimensional space , and parameters , construct a data structure s.t., given any query point , if there exits an -near neighbor of in , it reports some -near neighbor of in with probability .
- We can assume , just by simply dividing all coordinates by . Thus -NN.
Definition of Randomized -near neighbor reporting. Given a set of points in a -dimensional space , and parameter , construct a data structure that, given any query point , reports each -near neighbor of in with probability .
Locality-Sensitive Hash Function
The LSH algorithm relies on the existence of locality-sensitive hash functions.
Let be a family of hash functions mapping to some universe . For any 2 point and , consider a process in which we choose a function from uniformly at random, and analyze the probability that .
The family is called -sensitive if any two points .
- if then
- if then
如果两个点很接近的话,那它们哈希碰撞的概率很高。
如果两个点很远的话,那它们哈希碰撞的概率很低。
In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy .
LSH Algorithm
An LSH family is used for efficient algorithm for approximate near neighbor search.
We need to amplify the gap between ad .
For parameters and , we choose functions , where are chosen independently and uniformly at random from . We are also given a distance function dist: . These are the actual functions that we use to hash the data points.
The data structure is constructed by placing each point from the input set into a bucket . This way the data structure uses only memory. It suffices that buckets store the pointers to data points, not the points themselves.
To process a query , we scan through the buckets , and retrieve the points stored in them. If there is an -near neighbor of , we must report some point that is a -near neighbor of .
Two concrete scanning strategies:
Interrupt the search after finding the first points for some parameter .
-near neighbor problem
With , yields a solution to the randomized -approximate -near neighbor problem, with and for some constant failure probability . To obtain this guarantee, it suffices to set to , where .
Continue the search until all points from all buckets are retrieved
-near neighbor reporting problem
The value of the failure probability depends on the choice of the parameters and . For each , one can provide parameters and so that the error probability is smaller than . The query time is also dependent on and , the worst case could be as high as . But for many natural datasets, a proper choice of parameters results in a sublinear query time.
The details of the analysis are as follows. Let be any -neighbor of , consider any parameter . For any function , the probability that for some is at least . If we set so that , then any -neighbor of is returned by the algorithm with probability at least .
How to choose ? Larger lead to a larger gap between the probabilities of collision for close points and far points.
LSH Library
Several distance metrics is defined here.
Hamming distance, distance, distance, Jaccard, Arccos etc.
Near-Optimal LSH Function for Eucildean Distance
A new LSH family, yielding an algorithm with query time exponent . For large enough the value of tends to .