Recall the problem of nearest neighbor search: Randomized c-approximate R-near neighbor search. Recall that given a set P of n points in Rd and parameters R>0,δ>0. Also given a distance function dist:Rd×Rd→R. For a query point q, if there is an R-near neighbor p of q (dist(p,q)≤R), we must report some point p′ that is a cR-near neighbor of q (dist(p′,q)≤cR) with probability 1−δ.
Sensitive Hash Functions. Family of hash functions H to be (R,cR,P1,P2)-sensitive for dist if
for all p,q∈Rd with dist(p,q)≤R:Prh∼H[h(p)=h(q)]≥P1.
for all p,q∈Rd with dist(p,q)≥cR:Prh∼H[h(p)=h(q)]≤P2.
Solutions. With query time O(L(d+tk)) and space O(nd+Ln), where L=O(nlnP1/lnP2), t is the time to evaluate a hash function from H and k=logP2(1/n).
Hamming Distance. Hash functions for Hamming distance: a uniform hash function h in H chooses a uniform random bit position and returns the corresponding bit as the hash value. H was (R,cR,1−R/d,1−cR/d)-sensitive and we proved that exponent ρ=lnP1/lnP2 satisfies ρ≤1/c.
Other Distance Measures
l1-distance
For 2 points p,q∈Rd, we have ℓ1-distance ∑i=1d∣pi−qi∣. Given the radius R and approximation factor c, we can design a sensitive family of hash functions as follows: For a radius w to be determined, consider the family:
H={hi,o(x)=⌊wxi−o⌋∣i∈{1,⋯,d},o∈[0,w)}
To get a hash function from H, a coordinate i and offset o are drawn uniformly and independently.
The hash value of the point x is ⌊(xi−o)/w⌋.
Consider i and hi,o has been chosen, but o is still uniform random in [0,w). For points p,q, hi,o(p) and hi,o(q) equal precisely when ⌊(pi−o)/w⌋=⌊(qi−o)/w⌋. The expression ⌊(x−oi)/w⌋ changes value precisely at points x=oi+jw for j∈Z.
What’s the probability that a point of the form oi+jw lies between pi and qi?
If ∣pi−qi∣≥w, then this happens with probability 1.
Otherwise, the probability is ∣pi−qi∣/w.
Thus we have Pr[⌊(pi−oi)/w⌋=⌊(qi−oi)/w⌋]=max{0,1−∣pi−qi∣/w}.
Since i is chosen uniformly, we get
h∼HPr[h(p)=h(q)]=i=1∑dd1max{0,1−∣pi−qi∣/w}.
Let us first assume dist(p,q)≤R. If we choose w≥R, then the above simplifies to
Consider the data consists of sets rather than points.
Input consist of n sets S1,⋯,Sn, all subsets of a universe U.
Measure for set similarity: Jaccard coefficient
J(A,B)=∣A∪B∣∣A∩B∣
The coefficient lies between 0 and 1 with J(A,B)=1 when A and B are identical and J(A,B)=0 when A and B are disjoint.
The Nearest Neighbor Queries on sets: Given a set C, find the set Si in the data that is most similar to C.
Goal: Use the LSH. For LSH we need to distance metric. Which is the opposite of Jaccard coefficient.
distJ(A,B)=1−J(A,B)
How to define LSH function for this distance metric? Use min-wise independent families of hash function. A family of hash functions H:U→R is min-wise independent if ∀x∈U,S⊂U with x∈/S, it holds that
h∼HPr[h(x)<y∈Sminh(y)]=∣S∣+11.
The above definition means that in any set, each element has exactly the same probability of receiving the smallest hash value. We assume that any h∈H returns distinct values for all x∈U i.e. h(x)=h(y) for any x=y and any h∈H.
Assume that we have a min-wise independent family of hash functions H and define from it a new family of hash function GH⊂2U→U:
GH={gh(A)=argx∈Aminh(x)∣h∈H}.
This definition needs a bit of explanation. The notation 2U refers to all subsets of the universe U. So GH consists of functions that map sets A⊆U to hash values in U.
The family GH has one function gh for each function h∈H. To evaluate gh(A), we compute the element x∈A receiving the smallest hash value when using h, and we return that element x as the hash value of A. Simply speaking, gh(A) returns the element in A with smallest hash value according to h,
Is GH sensitive w.r.t. distJ? Consider 2 sets A,B and let x∗=argminx∈A∪Bh(x) be the element with smallest hash value in the union A∪B. Observe that gh(A)=gh(B) iff x∗∈A∩B. Now define for every x∈A∩B the event Ex s.t. h(x)<ming∈(A∪B)╲{x}h(y), i.e., Ex is the event that x has the smallest hash value among all elements in A∪B, we have
Then we bound k as k=lnP2(1/n)=ln(1/n)/ln(1−cR). Analogously,
k≤ln(1/n)/ln(exp(−cR))=lnn/(cR).
Constructing Min-Wise Independent Families
All of above assumed that we had a min-wise independent family of hash functions H. Unfortunately it can be proved that such families of hash functions are impossible to construct using a small random seed. However, we can still find efficient families that are only approximately min-wise independent.
It suffices to get reasonable nearest neighbor search structures.
We say that a family of hash function H is ϵ-approximate min-wise independent, if for any x∈U and any S⊂U with x∈/S, we have
h∼HPr[h(x)<y∈Sminh(y)]∈∣S∣+11±ϵ
Indyk showed that any O(ln(1/ϵ))-wise independent family of hash functions is also ϵ-approximate min-wise independent. A classic example of k-wise independent family of hash function is the following for any prime p>U:
H={hak−1,⋯,a0(x)=i=1∑k−1aixi mod p∣a0,⋯,ak−1∈[p]}.
Someone proved that the simple tabulation hash function is ϵ-approximate min-wise independent with ϵ=O(lg2n/n1/c) where c is number of tables used for the hash function and n is the cardinality of the set S. So the quality depends on the set cardinalities and therefore simple tabulation hashing works best for rather large sets.
How the approximation affects the sensitivity of the hash function: