Nearest Neighbor Search

2D scenario: Voronoi Diagram

3D or higher dimensions: Not efficient.

Fine-grained complexity

Assume the distance between query point and its nearest neighbor is RR, cc-approximate RR-near neighbor is all the points within range cR(c1)cR(c\ge1).

  • input: point qq, constant cc, radium RR.
  • if there is a point pp with dist(p,q)Rdist(p,q)\le R, then return point pp' with dist(p,q)cRdist(p',q)\le cR with probability 1δ\ge 1-\delta.

Locality Sensitive Hashing

abbr. LSH

Distance: dist:Rd×RdRdist:\mathbb R^d\times\mathbb R^d\rarr\mathbb R. 1,2\ell_1,\ell_2, Manhattan, Hamming…

Hash function hh is (R,cR,P1,P2)(R,cR,P_1,P_2)-sensitive w.r.t. distdist if

  1. p,q\forall p,q with dist(p,q)R:Pr[h(p)=h(q)]P1dist(p,q)\le R:\Pr[h(p)=h(q)]\ge P_1.
  2. p,q\forall p,q with dist(p,q)cR:Pr[h(p)=h(q)]P2dist(p,q)\ge cR:\Pr[h(p)=h(q)]\le P_2.
  3. P1P2P_1\ge P_2.

Store all points pSp\in S in hash table with key h(p)h(p) which hh is (R,cR,P1,P2)(R,cR,P_1,P_2)-sensitive.

A[y]A[y] is linked list of all points pSp'\in S with h(p)=yh(p')=y,

1
2
3
4
5
6
7
8
9
10
11
12
13
HashMap<Integer,linkedList<Point>>: hm:...
//......
for(Point p: points){
Integer key=h(p);
if hm.contains(key){
hm.get(key).add(p);
}
else {
linkedList<Point> l=...
l.add(p);
hm.put(key,l);
}
}

Query:

1
2
3
4
5
6
7
8
9
10
Point Query(Point q){
Integer key=h(q);
linkedList<Point>l=hm.get(key);
for(Point p:l){
if(dist(p,q)<=c*R){
return p;
}
}
return null;
}

Let XpX_p be 11 if h(p)=h(q)h(p)=h(q) and 00 otherwise.

E[pXp]=pE[Xp]=pPrh[h(p)=h(q)]\mathbb E[\sum_pX_p]=\sum_p\mathbb E[X_p]=\sum_p\Pr_h[h(p)=h(q)]

Parameter kk, g(p)g(p) is the concatenates of 010-1 bits of h1(p),,hk(p)h_1(p),\cdots,h_k(p). hih_i are independent.

For dist(p,q)Rdist(p,q)\le R, Prg[g(p)=g(q)]P1k\Pr_g[g(p)=g(q)]\ge P_1^k.

For dist(p,q)Rdist(p,q)\ge R, Prg[g(p)=g(q)]P2k1/n\Pr_g[g(p)=g(q)]\le P_2^k\le1/n.

Let XpX_p be 11 if g(p)=g(q)g(p)=g(q) and 00 otherwise.

E[pXp]=dist(p,q)cRPr[g(p)=g(q)]+dist(p,q)RPr[g(p)=g(q)]\mathbb E[\sum_pX_p]=\sum_{dist(p,q)\ge cR}\Pr[g(p)=g(q)]+\sum_{dist(p,q)\le R}\Pr[g(p)=g(q)]

Choosing k=logP2(1/n)k=\log_{P_2}(1/n)

Better probability:

For parameters L,kL,k. For i=1,,Li=1,\cdots,L, gi(p)g_i(p) is the concatenates of 010-1 bits of hi1(p),,hik(p)h_{i1}(p),\cdots,h_{ik}(p). hih_i are independent.

Distance metric: Hamming distance.

for i=1,,Li=1,\cdots,L: store SS in hash table HiH_i.

Query(q):

  • For i=1,,Li=1,\cdots,L

    • scan list Hi[gi(q)]H_i[g_i(q)]:

      if see pp with dist(p,q)cRdist(p,q)\le cR

      • return p
    • if scan >3L>3L points

      • abort

Space: O(Ln+nd)O(Ln+nd)

Query time: O(Lkt+Ld)O(Lkt+Ld) LL is time to evaluate hh.

Probability Bound for Failure

Two types of failure:

E1={i:gi(q)gi(p)}E_1=\{\forall_i:g_i(q)\ne g_i(p^*)\}

E_2=\{\sum_{i=1}^L|\{p:dist(p,q)\gt cR\and g_i(q)=g_i(p)\}|\gt3L\}

If none of them happens, the results is correct.

We want Pr[E1E2]Pr[E1]+Pr[E2]1/2\Pr[E_1\cup E_2]\le\Pr[E_1]+\Pr[E_2]\le1/2.

\Pr[E_2]\le\mathbb E[\sum_{i=1}^L|\{p:dist(p,q)\gt cR\and g_i(q)=g_i(p)\}|]/3L\\ =\sum_{i=1}^L....\\ \le L/3L=1/3\\ \Pr[E_1]=\Pr[\forall i:g_i(p)\neq g_i(p^*)]\\ =\prod_{i=1}^L\Pr[g_i(q)\neq g_i(p^*)]\\ =\prod_{i=1}^L(1-\Pr[g_i(q)=g_i(p^*)])\\ =(1-P_1^k)^L\\

Choosing L=log1P1k16L=\log_{1-P_1^k}\frac{1}{6}, we will get 1/21/2 probability bound for failure.

Upper bound for L

L=ln1/6ln(1P1k)ln(1/6)ln(exp(P1k))=ln1/6P1k=ln6/P1k=O(P1k)P1k=1P1logP21/n==nlnP1/lnP2=nln(1/P1)/ln(1/P2)L=\frac{\ln 1/6}{\ln(1-P_1^k)}\le\frac{\ln(1/6)}{\ln(\exp(-P_1^k))}=\frac{\ln1/6}{-P_1^k}=\ln6/P_1^k=O(P_1^{-k})\\ P_1^{-k}=\frac{1}{P_1^{\log_{P_2}1/n}}\\ =\cdots\cdots\\ =n^{\ln P_1/\ln P_2}=n^{\ln(1/P_1)/\ln(1/P_2)}

For hamming distance:

P1=1R/dP2=1cR/dsupL=n1/cP_1=1-R/d\\ P_2=1-cR/d\\ \sup L=n^{1/c}

Digest from Reference Paper

Efficient algorithms for the approximate and exact nearest neighbor problem.

Goal: Preprocess a dataset of objects. Later, given a new query object, one can quickly return the dataset object that is most similar to the query.

Nearest Neighbor Problem

Given a collection of nn points, build a data structure which, given any query point, reports the data point that is closest to the query.

A point Rd\mathbb R^d and the distance metric. It’s hard when dd is large.

In approximation version, the algorithm is allowed to return a point whose distance from the query is at most cc times the distance from the query to its nearest points.

c>1c\gt1 is called the approximation factor.

One of the most popular algorithms for performing approximate search in high dimensions based on locality-sensitive hashing.

Key idea: hash the point using several hash functions to ensure that for each function, the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.

Definitions

We say point pp is an RR-near neighbor of a point qq if the distance between pp and qq is at most RR.

Definition of Randomized cc-approximate RR-near neighbor (abbr. (c,R)(c,R)-NN). Given a set PP of points in a dd-dimensional space Rd\mathbb R^d, and parameters R>0,δ>0R\gt0,\delta\gt0, construct a data structure s.t., given any query point qq, if there exits an RR-near neighbor of qq in PP, it reports some cRcR-near neighbor of qq in PP with probability 1δ1-\delta.

  • We can assume R=1R=1, just by simply dividing all coordinates by RR. Thus cc-NN.

Definition of Randomized RR-near neighbor reporting. Given a set PP of points in a dd-dimensional space Rd\mathbb R^d, and parameter R>0,δ>0R\gt0,\delta\gt0, construct a data structure that, given any query point qq, reports each RR-near neighbor of qq in PP with probability 1δ1-\delta.

Locality-Sensitive Hash Function

The LSH algorithm relies on the existence of locality-sensitive hash functions.

Let H\mathcal H be a family of hash functions mapping Rd\mathbb R^d to some universe UU. For any 2 point pp and qq, consider a process in which we choose a function hh from H\mathcal H uniformly at random, and analyze the probability that h(p)=h(q)h(p)=h(q).

The family H\mathcal H is called (R,cR,P1,P2)(R,cR,P_1,P_2)-sensitive if any two points p,qRdp,q\in\mathbb R^d.

  • if pqR\Vert p-q\Vert\le R then PrH[h(q)=h(p)]P1\Pr_\mathcal H[h(q)=h(p)]\ge P_1
  • if pqcR\Vert p-q\Vert\ge cR then PrH[h(q)=h(p)]P2\Pr_\mathcal H[h(q)=h(p)]\le P_2

如果两个点很接近的话,那它们哈希碰撞的概率很高。

如果两个点很远的话,那它们哈希碰撞的概率很低。

In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy P1>P2P_1\gt P_2.

LSH Algorithm

An LSH family H\mathcal H is used for efficient algorithm for approximate near neighbor search.

We need to amplify the gap between P1P_1 ad P2P_2.

For parameters kk and LL, we choose LL functions gj(q)=(h1,j(q),,hk,j(q))g_j(q)=(h_{1,j}(q),\cdots,h_{k,j}(q)), where ht,j(1<t<k,1<j<L)h_{t,j}(1\lt t\lt k,1\lt j\lt L) are chosen independently and uniformly at random from H\mathcal H. We are also given a distance function dist: Rd×RdR\mathbb R^d\times\mathbb R^d\rarr\mathbb R. These are the actual functions that we use to hash the data points.

The data structure is constructed by placing each point pp from the input set into a bucket gj(p)g_j(p). This way the data structure uses only O(nL)O(nL) memory. It suffices that buckets store the pointers to data points, not the points themselves.

To process a query qq, we scan through the buckets g1(q),,gL(q)g_1(q),\cdots,g_L(q), and retrieve the points stored in them. If there is an RR-near neighbor pp of qq dist(p,q)Rdist(p,q)\le R, we must report some point pp' that is a cRcR-near neighbor of qq.

Two concrete scanning strategies:

  1. Interrupt the search after finding the first LL' points for some parameter LL'.

    (c,R)(c,R)-near neighbor problem

    With L=3LL'=3L, yields a solution to the randomized cc-approximate RR-near neighbor problem, with RR and δ\delta for some constant failure probability δ<1\delta\lt1. To obtain this guarantee, it suffices to set LL to Θ(nρ)\Theta(n^\rho), where ρ=ln1/P1ln1/P2\rho=\frac{\ln 1/P_1}{\ln 1/P_2}.

  2. Continue the search until all points from all buckets are retrieved

    RR-near neighbor reporting problem

    The value of the failure probability δ\delta depends on the choice of the parameters kk and LL. For each δ\delta, one can provide parameters kk and LL so that the error probability is smaller than δ\delta. The query time is also dependent on kk and LL, the worst case could be as high as Θ(n)\Theta(n). But for many natural datasets, a proper choice of parameters results in a sublinear query time.

    The details of the analysis are as follows. Let pp be any RR-neighbor of qq, consider any parameter kk. For any function gig_i, the probability that gi(p)=gi(q)g_i(p)=g_i(q) for some i=1,,Li=1,\cdots,L is at least 1(1P1k)L1-(1-P_1^k)^L. If we set L=log1P1kδL=\log_{1-P_1^k}\delta so that (1P1k)Lδ(1-P_1^k)^L\le\delta, then any RR-neighbor of qq is returned by the algorithm with probability at least 1δ1-\delta.

    How to choose kk? Larger kk lead to a larger gap between the probabilities of collision for close points and far points.

LSH Library

Several distance metrics is defined here.

Hamming distance, 1\ell_1 distance, s\ell_s distance, Jaccard, Arccos etc.

Near-Optimal LSH Function for Eucildean Distance

A new LSH family, yielding an algorithm with query time exponent ρ(c)=1/c2+O(loglogn/log1/3n)\rho(c)=1/c^2+O(\log\log n/\log^{1/3}n). For large enough nn the value of ρ(c)\rho(c) tends to 1/c21/c^2.