Recall

Recall the problem of nearest neighbor search: Randomized cc-approximate RR-near neighbor search. Recall that given a set PP of nn points in Rd\mathbb R^d and parameters R>0,δ>0R\gt0,\delta\gt0. Also given a distance function dist:Rd×RdRdist:\mathbb R^d\times\mathbb R^d\rarr\mathbb R. For a query point qq, if there is an RR-near neighbor pp of qq (dist(p,q)Rdist(p,q)\le R), we must report some point pp' that is a cRcR-near neighbor of qq (dist(p,q)cRdist(p',q)\le cR) with probability 1δ1-\delta.

Sensitive Hash Functions. Family of hash functions H\mathcal H to be (R,cR,P1,P2)(R,cR,P_1,P_2)-sensitive for distdist if

  1. for all p,qRdp,q\in\mathbb R^d with dist(p,q)R:PrhH[h(p)=h(q)]P1dist(p,q)\le R:\Pr_{h\sim\mathcal H}[h(p)=h(q)]\ge P_1.
  2. for all p,qRdp,q\in\mathbb R^d with dist(p,q)cR:PrhH[h(p)=h(q)]P2dist(p,q)\ge cR:\Pr_{h\sim\mathcal H}[h(p)=h(q)]\le P_2.

Solutions. With query time O(L(d+tk))O(L(d+tk)) and space O(nd+Ln)O(nd+Ln), where L=O(nlnP1/lnP2)L=O(n^{\ln P_1/\ln P_2}), tt is the time to evaluate a hash function from HH and k=logP2(1/n)k=\log_{P_2}(1/n).

Hamming Distance. Hash functions for Hamming distance: a uniform hash function hh in H\mathcal H chooses a uniform random bit position and returns the corresponding bit as the hash value. H\mathcal H was (R,cR,1R/d,1cR/d)(R,cR,1-R/d,1-cR/d)-sensitive and we proved that exponent ρ=lnP1/lnP2\rho=\ln P_1/\ln P_2 satisfies rho1/crho\le 1/c.

Other Distance Measures

l1-distance

For 2 points p,qRdp,q\in\mathbb R^d, we have 1\ell_1-distance i=1dpiqi\sum_{i=1}^d|p_i-q_i|. Given the radius RR and approximation factor cc, we can design a sensitive family of hash functions as follows: For a radius ww to be determined, consider the family:

H={hi,o(x)=xiowi{1,,d},o[0,w)}\mathcal H=\left\{h_{i,o}(x)=\left\lfloor\frac{x_i-o}{w}\right\rfloor| i\in\{1,\cdots,d\},o\in[0,w)\right\}

To get a hash function from H\mathcal H, a coordinate ii and offset oo are drawn uniformly and independently.

The hash value of the point xx is (xio)/w\lfloor(x_i-o)/w\rfloor.

Consider ii and hi,oh_{i,o} has been chosen, but oo is still uniform random in [0,w)[0,w). For points p,qp,q, hi,o(p)h_{i,o}(p) and hi,o(q)h_{i,o}(q) equal precisely when (pio)/w=(qio)/w\lfloor(p_i-o)/w\rfloor=\lfloor(q_i-o)/w\rfloor. The expression (xoi)/w\lfloor(x-o_i)/w\rfloor changes value precisely at points x=oi+jwx=o_i+jw for jZj\in\mathbb Z.

What’s the probability that a point of the form oi+jwo_i+jw lies between pip_i and qiq_i?

  • If piqiw|p_i-q_i|\ge w, then this happens with probability 11.
  • Otherwise, the probability is piqi/w|p_i-q_i|/w.

Thus we have Pr[(pioi)/w=(qioi)/w]=max{0,1piqi/w}\Pr[\lfloor(p_i-o_i)/w\rfloor=\lfloor(q_i-o_i)/w\rfloor]=\max\{0,1-|p_i-q_i|/w\}.

Since ii is chosen uniformly, we get

PrhH[h(p)=h(q)]=i=1d1dmax{0,1piqi/w}.\Pr_{h\sim\mathcal H}[h(p)=h(q)]=\sum_{i=1}^d\frac{1}{d}\max\{0,1-|p_i-q_i|/w\}.

Let us first assume dist(p,q)Rdist(p,q)\le R. If we choose wRw\ge R, then the above simplifies to

PrhH[h(p)=h(q)]=i=1d1d(1piqi/w)=1i=1dpiqidw=1dist(p,q)dw1Rdw\Pr_{h\sim\mathcal H}[h(p)=h(q)]=\sum_{i=1}^d\frac1d\cdot(1-|p_i-q_i|/w)\\ =1-\frac{\sum_{i=1}^d|p_i-q_i|}{dw}=1-\frac{dist(p,q)}{dw}\ge1-\frac{R}{dw}

On the other hand, assume dist(p,q)>cRdist(p,q)\gt cR and assume we choose wcRw\ge cR. Then we get:

PrhH[h(p)=h(q)]=i=1d1dmax{0,1piqi/w}=i=1d1d(1min{w,piqi}w)=1i=1dmin{w,piqi}dw.\Pr_{h\sim\mathcal H}[h(p)=h(q)]=\sum_{i=1}^d\frac{1}{d}\max\{0,1-|p_i-q_i|/w\}\\ =\sum_{i=1}^d\frac{1}{d}\cdot(1-\frac{\min\{w,|p_i-q_i|\}}{w})\\ =1-\frac{\sum_{i=1}^d\min\{w,|p_i-q_i|\}}{dw}.

Assume first that i:piqiw\forall i:|p_i-q_i|\le w. Then the above equals

1i=1dpiqidw=1dist(p,q)dw1cRdw1-\frac{\sum_{i=1}^d|p_i-q_i|}{dw}=1-\frac{dist(p,q)}{dw}\le1-\frac{cR}{dw}

On the other hand, assume i:piqi>w\exist i:|p_i-q_i|\gt w, then the above has upper bound:

PrhH[h(p)=h(q)]=1i=1dmin{w,piqi}dw1wdw1cRdw\Pr_{h\sim\mathcal H}[h(p)=h(q)]=1-\frac{\sum_{i=1}^d\min\{w,|p_i-q_i|\}}{dw}\\ \le 1-\frac{w}{dw}\le1-\frac{cR}{dw}

Thus H\mathcal H is (R,cR.1R/(dw),1cR/(dw))(R,cR.1-R/(dw),1-cR/(dw))-sensitive if we choose wcRw\ge cR. This results in L=O(nρ)L=O(n^\rho) for

ρ=lnP1lnP2=ln(1R/(dw))ln(1cR/(dw))ln(1R/(dw))ln(1R/(dw))c=1c\rho=\frac{\ln P_1}{\ln P_2}=\frac{\ln(1-R/(dw))}{\ln(1-cR/(dw))}\le\frac{\ln(1-R/(dw))}{\ln(1-R/(dw))^c}=\frac{1}{c}

We also have k=logP2(1/n)=ln(1/n)ln(1cR/(dw))k=\log_{P_2}(1/n)=\frac{\ln(1/n)}{\ln(1-cR/(dw))}. By 1xexp(x)1-x\le\exp(-x), we have

k=logP2(1/n)=ln(1/n)ln(1cR/(dw))ln(1/n)cR/(dw)=dwlnn/(cR).k=\log_{P_2}(1/n)=\frac{\ln(1/n)}{\ln(1-cR/(dw))}\le\frac{\ln(1/n)}{-cR/(dw)}=dw\ln n/(cR).

Consider the data consists of sets rather than points.

Input consist of nn sets S1,,SnS_1,\cdots,S_n, all subsets of a universe UU.

Measure for set similarity: Jaccard coefficient

J(A,B)=ABABJ(A,B)=\frac{|A\cap B|}{|A\cup B|}

The coefficient lies between 00 and 11 with J(A,B)=1J(A,B)=1 when AA and BB are identical and J(A,B)=0J(A,B)=0 when AA and BB are disjoint.

The Nearest Neighbor Queries on sets: Given a set CC, find the set SiS_i in the data that is most similar to CC.

Goal: Use the LSH. For LSH we need to distance metric. Which is the opposite of Jaccard coefficient.

distJ(A,B)=1J(A,B)dist_J(A,B)=1-J(A,B)

How to define LSH function for this distance metric? Use min-wise independent families of hash function. A family of hash functions H:UR\mathcal H:U\rarr\mathbb R is min-wise independent if xU,SU\forall x\in U,S\subset U with xSx\notin S, it holds that

PrhH[h(x)<minySh(y)]=1S+1.\Pr_{h\sim\mathcal H}[h(x)\lt\min_{y\in S}h(y)]=\frac{1}{|S|+1}.

The above definition means that in any set, each element has exactly the same probability of receiving the smallest hash value. We assume that any hHh\in\mathcal H returns distinct values for all xUx\in U i.e. h(x)h(y)h(x)\neq h(y) for any xyx\neq y and any hHh\in\mathcal H.

Assume that we have a min-wise independent family of hash functions H\mathcal H and define from it a new family of hash function GH2UU:\mathcal G_\mathcal H\subset 2^U\rarr U:

GH={gh(A)=argminxAh(x)hH}.\mathcal G_\mathcal H=\{g_h(A)=\arg\min_{x\in A}h(x)|h\in\mathcal H\}.

This definition needs a bit of explanation. The notation 2U2^U refers to all subsets of the universe UU. So GH\mathcal G_\mathcal H consists of functions that map sets AUA\subseteq U to hash values in UU.

The family GH\mathcal G_\mathcal H has one function ghg_h for each function hHh\in\mathcal H. To evaluate gh(A)g_h(A), we compute the element xAx\in A receiving the smallest hash value when using hh, and we return that element xx as the hash value of AA. Simply speaking, gh(A)g_h(A) returns the element in AA with smallest hash value according to hh,

Is GH\mathcal G_\mathcal H sensitive w.r.t. distJdist_J? Consider 2 sets A,BA,B and let x=argminxABh(x)x^*=\arg\min_{x\in A\cup B}h(x) be the element with smallest hash value in the union ABA\cup B. Observe that gh(A)=gh(B)g_h(A)=g_h(B) iff xABx^*\in A\cap B. Now define for every xABx\in A\cap B the event ExE_x s.t. h(x)<ming(AB){x}h(y)h(x)\lt\min_{g\in(A\cup B)\diagdown\{x\}}h(y), i.e., ExE_x is the event that xx has the smallest hash value among all elements in ABA\cup B, we have

PrghGH[gh(A)=gh(B)]=PrhH[argminxABh(x)AB]=PrhH[xABEx]\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)]=\Pr_{h\sim H}[\arg\min_{x\in A\cup B}h(x)\in A\cap B]\\ =\Pr_{h\sim\mathcal H}[\bigcup_{x\in A\cap B}E_x]

The events ExE_x are disjoint. Therefore

PrhH[xABEx]=xABPrhH[Ex]=xAB1(AB){x}+1=ABAB=J(A,B)\Pr_{h\sim\mathcal H}[\bigcup_{x\in A\cap B}E_x]=\sum_{x\in A\cap B}\Pr_{h\sim\mathcal H}[E_x]\\ =\sum_{x\in A\cap B}\frac{1}{|(A\cup B)\diagdown\{x\}|+1}\\ =\frac{|A\cap B|}{|A\cup B|}=J(A,B)

Now we bound P1,P2P_1,P_2.

  • For P1P_1. Let A,BA,B have distJ(A,B)Rdist_J(A,B)\le R. We see that RdistJ(A,B)=1J(A,B)=1PrghGH[gh(A)=gh(B)]R\ge dist_J(A,B)=1-J(A,B)=1-\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)], implying PrghGH[gh(A)=gh(B)]1R\Pr_{g_h\sim \mathcal G_\mathcal H}[g_h(A)=g_h(B)]\ge1-R.

  • For P2P_2. Next let A,BA,B have distJ(A,B)>cRdist_J(A,B)\gt cR. Then we have cR<distJ(A,B)=1J(A,B)=1PrghGH[gh(A)=gh(B)]cR\lt dist_J(A,B)=1-J(A,B)=1-\Pr_{g_h\sim \mathcal G_\mathcal H}[g_h(A)=g_h(B)] implying PrghGH[gh(A)=gh(B)]<1cR\Pr_{g_h\sim \mathcal G_\mathcal H}[g_h(A)=g_h(B)]\lt 1-cR.

Thus the family GH\mathcal G_\mathcal H is thus (R,cR,1R,1cR)(R,cR,1-R,1-cR)-sensitive.

Then we bound ρ\rho in L=O(nρ)L=O(n^\rho):

ρ=lnP1/lnP2=ln(1R)/ln(1cR)ln(1R)/ln(1R)c=1/c.\rho=\ln P_1/\ln P_2=\ln(1-R)/\ln(1-cR)\le\ln(1-R)/\ln(1-R)^c=1/c.

Then we bound kk as k=lnP2(1/n)=ln(1/n)/ln(1cR)k=\ln_{P_2}(1/n)=\ln(1/n)/\ln(1-cR). Analogously,

kln(1/n)/ln(exp(cR))=lnn/(cR).k\le\ln(1/n)/\ln(\exp(-cR))=\ln n/(cR).

Constructing Min-Wise Independent Families

All of above assumed that we had a min-wise independent family of hash functions H\mathcal H. Unfortunately it can be proved that such families of hash functions are impossible to construct using a small random seed. However, we can still find efficient families that are only approximately min-wise independent.

It suffices to get reasonable nearest neighbor search structures.

We say that a family of hash function H\mathcal H is ϵ\epsilon-approximate min-wise independent, if for any xUx\in U and any SUS\subset U with xSx\notin S, we have

PrhH[h(x)<minySh(y)]1±ϵS+1\Pr_{h\sim\mathcal H}[h(x)\lt\min_{y\in S}h(y)]\in\frac{1\pm\epsilon}{|S|+1}

Indyk showed that any O(ln(1/ϵ))O(\ln(1/\epsilon))-wise independent family of hash functions is also ϵ\epsilon-approximate min-wise independent. A classic example of kk-wise independent family of hash function is the following for any prime p>Up\gt U:

H={hak1,,a0(x)=i=1k1aixi mod pa0,,ak1[p]}.\mathcal H=\left\{ h_{a_{k-1},\cdots,a_0}(x)=\sum_{i=1}^{k-1}a_ix^i\text{ mod }p|a_0,\cdots,a_{k-1}\in[p] \right\}.

Someone proved that the simple tabulation hash function is ϵ\epsilon-approximate min-wise independent with ϵ=O(lg2n/n1/c)\epsilon=O(\lg^2n/n^{1/c}) where cc is number of tables used for the hash function and nn is the cardinality of the set SS. So the quality depends on the set cardinalities and therefore simple tabulation hashing works best for rather large sets.

How the approximation affects the sensitivity of the hash function:

PrghGH[gh(A)=gh(B)]=xABPrhH[Ex]xAB1+ϵ(AB){x}+1=(1+ϵ)ABAB=(1+ϵ)J(A,B).\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)]=\sum_{x\in A\cap B}\Pr_{h\sim\mathcal H}[E_x]\\ \le\sum_{x\in A\cap B}\frac{1+\epsilon}{|(A\cup B)\diagdown\{x\}|+1}\\ =(1+\epsilon)\cdot\frac{|A\cap B|}{|A\cup B|}\\ =(1+\epsilon)\cdot J(A,B).

Therefore, for distJ(A,B)>cRdist_J(A,B)\gt cR, we have

PrghGH[gh(A)=gh(B)](1+ϵ)J(A,B)(1+ϵ)(1distJ(A,B))(1+ϵ)(1cR).\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)]\\ \le (1+\epsilon)J(A,B)\\ \le(1+\epsilon)(1-dist_J(A,B))\\ \le(1+\epsilon)(1-cR).

Analogously, we also have

PrghGH[gh(A)=gh(B)]=xABPrhH[Ex]xAB1ϵ(AB){x}+1=(1ϵ)ABAB=(1ϵ)J(A,B).\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)]=\sum_{x\in A\cap B}\Pr_{h\sim\mathcal H}[E_x]\\ \ge\sum_{x\in A\cap B}\frac{1-\epsilon}{|(A\cup B)\diagdown\{x\}|+1}\\ =(1-\epsilon)\cdot\frac{|A\cap B|}{|A\cup B|}\\ =(1-\epsilon)\cdot J(A,B).

This implies that for distJ(A,B)Rdist_J(A,B)\le R, we have

PrghGH[gh(A)=gh(B)](1ϵ)J(A,B)=(1ϵ)(1distJ(A,B))(1ϵ)(1R).\Pr_{g_h\sim\mathcal G_\mathcal H}[g_h(A)=g_h(B)]\\ \ge(1-\epsilon)J(A,B)\\ =(1-\epsilon)(1-dist_J(A,B))\\ \ge(1-\epsilon)(1-R).

Thus GH\mathcal G_\mathcal H is (R,cR,(1R)(1ϵ),(1cR)(1+ϵ))(R,cR,(1-R)(1-\epsilon),(1-cR)(1+\epsilon))-sensitive.

l2-distance

dist(p,q)=i=1d(piqi)2dist(p,q)=\sqrt{\sum_{i=1}^d(p_i-q_i)^2}

Math Math Math!🤯

Let ww be a parameter to be fixed and consider the following family of hash functions

H={ho,u(x)=x,uowo[0,w),uRd}.\mathcal H=\left\{h_{o,u}(x)=\lfloor\frac{\langle x,u\rangle-o}{w}\rfloor|o\in[0,w),u\in\mathbb R^d\right\}.

We simply pick oo uniformly in [0,w)[0,w) and pick vector uu s.t. each coordinate is independently N(0,1)\mathcal N(0,1) distributed.

Consider two points p,qp,q and two fixed values of p,u\langle p,u\rangle and q,u\langle q,u\rangle (Once uu is chosen but oo is still uniform random). Similar to 1\ell_1-distance, we get

p,uow=q,uow\lfloor\frac{\langle p,u\rangle-o}{w}\rfloor=\lfloor\frac{\langle q,u\rangle-o}{w}\rfloor

with probability max{0,1pq,u/w}\max\{0,1-|\langle p-q,u\rangle|/w\}.

Then consider the distribution of pq,u\langle p-q,u\rangle:

pq,u=i=1d(piqi)uiN(0,i(piqi)2)N(0,dist(p,q)2)dist(p,q)N(0,1)pq,u/w(dist(p,q)/w)N(0,1)\langle p-q,u\rangle=\sum_{i=1}^d(p_i-q_i)u_i\sim\mathcal N(0,\sum_i(p_i-q_i)^2)\\ \sim\mathcal N(0,dist(p,q)^2)\sim dist(p,q)\cdot\mathcal N(0,1)\\ \Rarr \langle p-q,u\rangle/w\sim (dist(p,q)/w)\cdot\mathcal N(0,1)\\

By probability density function ff or N(0,1)\mathcal N(0,1):

Pr[h(p)=h(q)]=12πexp(x2/2)max{0,1(dist(p,q)/w)x}dxPr[h(p)=h(q)]=12πexp(x2/2)(1(dist(p,q)/w)x)dx=12πexp(x2/2)dx12πexp(x2/2)(dist(p,q)/w)xdx.\Pr[h(p)=h(q)]=\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\cdot\max\{0,1-(dist(p,q)/w)|x|\}dx\\ \ge\Pr[h(p)=h(q)]=\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\cdot(1-(dist(p,q)/w)|x|)dx\\ =\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)dx-\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)(dist(p,q)/w)|x|dx.

By symmetry and integral of ff is 11, the upper quantity

12(dist(p,q)/w)12π0exp(x2/2)xdx12(dist(p,q)/w)12π\ge1-2\cdot(dist(p,q)/w)\cdot\frac{1}{\sqrt{2\pi}}\cdot\int_0^\infty\exp(-x^2/2)xdx\\ \ge1-2\cdot(dist(p,q)/w)\cdot\frac{1}{\sqrt{2\pi}}

For p,qp,q with dist less than RR, this is at least 12/π(R/w)1-\sqrt{2/\pi}(R/w).

For upper bound on Pr[h(p)=h(q)]\Pr[h(p)=h(q)]:

Pr[h(p)=h(q)]=12πexp(x2/2)max{0,1(dist(p,q)/w)x}dx12πexp(x2/2)(1(dist(p,q)/w)x)dx+2w/dist(p,q)12πexp(x2/2)(dist(p,q)/w)xdx==12/π(dist(p,q)/w)(1e(w/dist(p,q))2/2).\Pr[h(p)=h(q)]=\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\cdot\max\{0,1-(dist(p,q)/w)|x|\}dx\\ \le\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\cdot(1-(dist(p,q)/w)|x|)dx\\ +2\cdot\int_{w/dist(p,q)}^\infty\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\cdot(dist(p,q)/w)xdx\\ =\cdots\cdots\\ =1-\sqrt{2/\pi}(dist(p,q)/w)(1-e^{-(w/dist(p,q))^2/2}).

Assume wdist(p,q)w\gg dist(p,q) and ignore (1e(w/dist(p,q))2/2)(1-e^{-(w/dist(p,q))^2/2}). Under this simplification, we get for p,qp,q with dist(p,q)>cRdist(p,q)\gt cR that Pr[h(p)=h(q)]12/π(cR/w)\Pr[h(p)=h(q)]\le1-\sqrt{2/\pi}(cR/w).

The we bound ρ\rho in L=O(nρ)L=O(n^\rho):

ρ=lnP1/lnP2=ln(12/π(R/w))/ln(12/π(cR/w))ln(12/π(R/w))/ln(12/π(R/w))c=1/c\rho=\ln P_1/\ln P_2=\ln(1-\sqrt{2/\pi}(R/w))/\ln(1-\sqrt{2/\pi}(cR/w))\\ \le\ln(1-\sqrt{2/\pi}(R/w))/\ln(1-\sqrt{2/\pi}(R/w))^c=1/c

If we don’t ignore (1e(w/dist(p,q))2/2)(1-e^{-(w/dist(p,q))^2/2}), then

ρ1c(1e(w/dist(p,q))2/2)\rho\le\frac{1}{c(1-e^{-(w/dist(p,q))^2/2})}

By choosing ww, we need to estimate on the largest value of dist(p,q)dist(p,q).

Then we bound kk.

P2=12/π(cR/w)P_2=1-\sqrt{2/\pi}(cR/w), ignoring (1e(w/dist(p,q))2/2)(1-e^{-(w/dist(p,q))^2/2}).

Be have k=logP2(1/n)=ln(1/n)/lnP2=ln(1/n)/ln(12/π(cR/w))k=\log_{P_2}(1/n)=\ln(1/n)/\ln P_2=\ln(1/n)/\ln(1-\sqrt{2/\pi}(cR/w)).

kln(1/n)/ln(e(w/dist(p,q))2/2)=lnn/(2/πcR/w)=O(wlnn/(cR))k\le\ln(1/n)/\ln(e^{-(w/dist(p,q))^2/2})=\ln n/(\sqrt{2/\pi}cR/w)=O(w\ln n/(cR))

This means we should be careful not to choose ww too large.