Lecturer: Kasper

Lecture notes is based on High Speed Hashing for Integers and Strings by Mikkel Thorup

Hash Functions

Universe UU. Mapping randomly to a range [m]={0,,m1}[m]=\{0,\cdots,m-1\}.

Truly random hash function h:U[m]h:U\rarr[m].

hh is idealized, can not be implemented. To represent a truly random hash functions, we need to store at least Ulog2m|U|\log_2m bits. (too large, impossible!)

Idea: hash functions contain only a small element or seed of randomness so that the hash function is sufficiently random for the desired application. e.g. Prime number pp, for random hash function h:[p][0,,p1]h:[p]\rarr[0,\cdots,p-1] could be h(x)=(ax+b)modph(x)=(ax+b)\mod p where a,ba,b are random numbers that together form the random seed describing the function.

Definition and Properties

Definition 1: A hash function h:U[m]h:U\rarr[m] is a random variable in the class of all functions U[m]U\rarr[m], i.e., it consists of a random variable h(x)h(x) for each xUx\in U.

3 things we care about hh:

  1. Space. The size of the random seed that is necessary to calculate h(x)h(x) given xx.
  2. Speed. The time it takes to calculate h(x)h(x) given xx.
  3. Properties of random variables.

Universal Hashing

To generate a random hash function h:U[m]h:U\rarr[m] from a key universe UU to a set of hash values [m]={0,,m1}[m]=\{0,\cdots,m-1\}. We think of hh as a random variable following some distribution over functions U[m]U\rarr[m]. We want hh to be universal which means that for any given distinct keys x,yUx,y\in U, when hh is picked at random, we have low collision probability:

Prh[h(x)=h(y)]1m\Pr_h[h(x)=h(y)]\le{1\over m}

For many applications, it suffices if for some c=O(1)c=O(1), we have

Prh[h(x)=h(y)]cm\Pr_h[h(x)=h(y)]\le{c\over m}

Then hh is called c-approximately universal.

Exercise 2.1 Is the truly independent hash function h:U[m]h : U \rarr [m] universal?

Yes.

Exercise 2.2 If a hash function h:U[m]h : U \rarr [m] has collision probability 0, how large must mm be?

mUm\ge|U|.

Exercise 2.3 Let umu \le m. Is the identity function f(x)=xf (x) = x a universal hash function [u][m][u] → [m]?

Yes.

Applications

hash tables with chaining. (detail see in Lecture 2)

Multiply-mod-prime

Assume m<um\lt u and m>1m\gt1.

The classic universal hash function is based on a prime number pup\ge u. We pick a uniformly random a[p]+={1,,p1}a\in[p]_+=\{1,\cdots,p-1\} and b[p]={1,,p1}b\in[p]=\{1,\cdots,p-1\}, and define ha,b:[u][m]h_{a,b}:[u]\rarr[m] by

ha,b(x)=((ax+b) mod p) mod mh_{a,b}(x)=((ax+b)\text{ mod }p)\text{ mod }m

Given any distinct x,y[u][p]x,y\in[u]\subseteq[p], we want for random a,ba,b:

Pra[p]+,b[p][ha,b(x)=ha,b(y)]<1m\Pr_{a\in[p]_+,b\in[p]}[h_{a,b}(x)=h_{a,b}(y)]\lt{1\over m}

⚠️Strict inequality! Strictly better.

Fact: If pp is prime and α,β[p]+\alpha,\beta\in[p]_+ then αβ≢0 mod p\alpha\beta\not\equiv0\text{ mod } p.

Let x,y[p],xyx,y\in[p],x\neq y be given. For given pair (a,b)[p]2(a,b)\in[p]^2, define (q,r)[p]2(q,r)\in[p]^2 by

(ax+b) mod p=q(ay+b) mod p=r(ax+b)\text{ mod }p=q\\ (ay+b)\text{ mod }p=r

Lemma: These two equations define a 1-1 correspondence between pairs (a,b)(a,b) and (q,r)(q,r).

(Use Fact to prove)

Assume that there exist two different pairs a,ba,b and a,ba',b' which corresponds to same q,rq,r.

(ax+b) mod p=q (1)(ay+b) mod p=r (2)(ax+b) mod p=q (3)(ay+b) mod p=r (4)(ax+b)\text{ mod }p=q\ (1)\\ (ay+b)\text{ mod }p=r\ (2)\\ (a'x+b')\text{ mod }p=q\ (3)\\ (a'y+b')\text{ mod }p=r\ (4)

For ((1)(2))((3)(4))((1)-(2))-((3)-(4)), by the Fact above, it yields aa0modpa-a'\equiv0\mod p.

For ((1)(3))((2)(4))((1)-(3))-((2)-(4)), it yields bb0modpb-b'\equiv0\mod p.

Thus a=a,b=ba=a',b=b' contradicts assumption.

Thus its bijection.

Since xyx\neq y, by Fact above, r=qa=0r=q\Lrarr a=0. Thus when we pick (a,b)[p]+×[p](a,b)\in[p]_+\times[p], we get rqr\neq q.

Why the inequality is strict?

We get a collision ha,b(x)=ha,b(y)h_{a,b}(x)=h_{a,b}(y) iff q mod m=r mod mq\text{ mod }m=r\text{ mod }m.

Let us fix qq and set i=q mod mi=q\text{ mod }m. There are at most p/m\lceil p/m\rceil values r[p]r\in[p] with r mod m=ir\text{ mod }m=i and one of them is r=qr=q. Therefore, the number of r[p]{q}r\in[p]\diagdown\{q\} with r mod m=i=q mod mr\text{ mod }m=i=q\text{ mod }m is at most p/m1(p+m1)/m1=(p1)/m\lceil p/m\rceil-1\le(p+m-1)/m-1=(p-1)/m. However, there are values of i[m]i'\in[m] with only p/m\lfloor p/m\rfloor values of q[p]q'\in[p] with q mod m=iq'\text{ mod }m=i', and then the number of r[p]{q}r'\in[p]\diagdown\{q'\} with r mod m=i=q mod mr'\text{ mod }m=i'=q'\text{ mod }m is p/m1<p/m1\lfloor p/m\rfloor-1\lt\lceil p/m\rceil-1. Summing over all pp values of q[p]q\in[p], we get that the number of r[p]{q}r\in[p]\diagdown\{q\} with r mod m=i=q mod mr\text{ mod }m=i=q\text{ mod }m is strictly less than p(p/m1)p(p1)/mp(\lceil p/m\rceil-1)\le p(p-1)/m. Then 1-1 correspondence implies that there are strictly less than p(p1)/mp(p-1)/m collision pairs (a,b)[p]+×[p](a,b)\in[p]_+\times[p]. Since each of the p(p1)p(p-1) pairs from [p]+×[p][p]_+\times[p] are equally likely, we conclude that the collision probability is strictly below 1/m1/m.

Implementation for 64-bit Keys

Hashing scheme

h(x)=((ax+b) mod p) mod mh(x)=((ax+b)\text{ mod }p)\text{ mod }m

u=264,m=220u=2^{64},m=2^{20}.

Since p>u=264p\gt u=2^{64}, we generally need to reserve more than 64 bits for a[p]+a\in[p]_+, so the product axax has more than 128 bits. We can get the product of 32-bit numbers, representing them as 64-bit numbers, and getting the full 64-bit result. We need at least 6 such 64-bit multiplications to compute axax.

How do we compute ax mod pax\text{ mod }p? For 64-bit numbers, we have a general mod-operation. Idea: Let pp be a Mersenne prime p=2q1p=2^q-1, e.g. 2611,28912^{61}-1,2^{89}-1.

x(x mod 2q+x/2q)modpx\equiv(x\text{ mod }2^q+\lfloor x/2^q\rfloor)\mod p

To prove this, it suffices to prove: x mod 2q+x/2q=xx/2qpx\text{ mod }2^q+\lfloor x/2^q\rfloor=x-\lfloor x/2^q\rfloor p. (Obviously)

1
2
3
//p=pow(2,q)-1
y=(x&p)+(x>>q);
if(y>=p) y-=p;

Multiply-shift

Hashing from ww-bit integers to \ell-bit integers.

We pick a uniformly random odd ww-bit integer aa, and then we compute ha:[2w][2]h_a:[2^w]\rarr[2^\ell], as

ha(x)=(ax mod 2w)/2w.h_a(x)=\lfloor(ax\text{ mod }2^w)/2^{w-\ell}\rfloor.

1
2
3
4
//(a*x%pow(2,w))>>(w-l)
return ((a%pow(2,w)*(x%pow(2,w)))%pow(2,w))>>(w-l);
//for w=64, variable are uint64_t
return (a*x)>>(64-l);

This scheme is faster than ha,b(x)=((ax+b) mod p) mod nh_{a,b}(x)=((ax+b)\text{ mod }p)\text{ mod }n. Numbers are stored as bit strings, with the least significant bit to the right. Integer division by a power of 2 is accomplished by a right shift. For hashing 64-bit integers, we further exploit that 64-bit multiplication automatically discards overflow, which is the same as multiplying modulo 2642^{64}.

Multiply-shift is 2-approximately universal, i.e., for xyx\neq y,

Prodd a[2w][ha(x)=ha(y)]2/2=2/m\Pr_{\text{odd }a\in[2^w]}[h_a(x)=h_a(y)]\le2/2^\ell=2/m

To prove this, it suffices to prove that the probability that bis w,,w1w-\ell,\cdots,w-1 of a(yx)a(y-x) are all 0s or all 1s is at most 2/22/2^\ell.

Fact: If α\alpha is odd and β[2q]+\beta\in[2^q]_+, then αβ≢0( mod 2q)\alpha\beta\not\equiv0(\text{ mod }2^q).

Proof omitted.

证明的大致思路:假设 x<y,ha(x)=ha(y)x\lt y,h_a(x)=h_a(y)。显然,如果要碰撞,那么 ay=ax+a(yx)ay=ax+a(y-x) 中,a(yx)a(y-x) 对应第 ww-\ell 到第 w1w-1\ell 位比特必须全 0 或者全 1。再证全 0 和全 1 的概率是 2/22/2^\ell.

再用上面的 Fact “任何奇数和 2 的多次幂互质”证明 a(yx)a(y-x) 对应第 ww-\ell 到第 w1w-1\ell 位比特是均匀分布即可。

Strong Universality

Strong universality: For h:[u][m]h:[u]\rarr[m], we consider pairwise events of the form that for given distinct keys x,y[u]x,y\in[u] and possibly non-distinct hash values q,r[m]q,r\in[m], we have h(x)=qh(x)=q and h(y)=rh(y)=r. We say a random hash function h:[u][m]h:[u]\rarr[m] is strongly universal if the probability of every pairwise event is 1/m21/m^2.

We note that if hh is strongly universal, it is also universal since

\Pr[h(x)=h(y)]=\sum_{q\in[m]}\Pr[h(x)=q\and h(y)=q]=m/m^2=1/m

Observation: An equivalent definition of strong universality is that each key is hashed uniformly into [m][m], and that every two distinct keys are hashed independently.

Proof omitted.

Emphasizing the independence, strong universality is also called 2-independence, as it concerns a pair of two events. As for universality, we may accept some relaxed notion of strong universality e.g. 3-independence, k-independence

c-approximately strongly universality: h:[U][m]h:[U]\rarr[m] is c-approximately strongly universal if

  1. hh is c-approximately uniform, meaning for every xUx\in U and for every hash value q[m]q\in[m], we have Pr[h(x)=q]c/m\Pr[h(x)=q]\le c/m;
  2. Every pair of distinct keys hash independently.

Applications

Coordinated sampling: allocation of strongly universal hashing.

Basic idea: small samples can reason about the similarity of huge sets.

First we consider sampling from a single set AUA\subseteq U using a strongly universal hash function h:U[m]h:U\rarr[m] and a threshold t{0,,m}t\in\{0,\cdots,m\}. We now sample xx if h(x)<th(x)\lt t, which by uniformity happens with probability t/mt/m for any xx.

Let Sh,t(A)={xAh(x)<t}S_{h,t}(A)=\{x\in A|h(x)\lt t\} denote the resulting sample from AA. Then by linearity of expectation, E[Sh,t(A)]=At/m\mathbb E[|S_{h,t}(A)|]=|A|\cdot t/m. Conversely, this means that if we have Sh,t(A)S_{h,t}(A), then we can estimate A|A| as Sh,t(A)m/t|S_{h,t}(A)|\cdot m/t.

The important application is not just sampling from a single set, but rather the sampling from different sets BB and CC so that we can later reason about the similarity, estimating the sizes of their union BCB\cup C and intersection BCB\cap C.

Sh,t(BC)=Sh,t(B)Sh,t(C)Sh,t(BC)=Sh,t(B)Sh,t(C)S_{h,t}(B\cup C)=S_{h,t}(B)\cup S_{h,t}(C)\\ S_{h,t}(B\cap C)=S_{h,t}(B)\cap S_{h,t}(C)

We can then estimate the size of the union and intersection multiplying the corresponding sample sizes by m/tm/t.

Sampling from different sets can be done in a distributed way.

Application: In machine learning, we can store the samples of many different large sets; On the Internet, routers can store samples of the packets passing through.

To get a fixed sampling probability t/mt/m for each xUx\in U, we only need that h:U[m]h:U\rarr[m] is uniform. This ensures that the estimate Sh,t(A)m/t|S_{h,t}(A)|\cdot m/t of AA is unbiased.

For aAa\in A, let Xa=[h(a)<t],X=aAXaX_a=[h(a)\lt t],X=\sum_{a\in A}X_a. Then X=Sh,t(A)X=|S_{h,t}(A)|, but the reasoning below applies when XaX_a is any 0-1 indicator variable that depends only h(a)h(a).

Because hh is strongly universal, for any distinct a,bAa,b\in A, we have that h(a),h(b)h(a),h(b) are independent, and hence so are Xa,XbX_a,X_b. Therefore X=aAXaX=\sum_{a\in A}X_a is the sum of pairwise independent 0-1 variables. Now the following concentration bound applies to XX.

Lemma: Let X=aAXaX=\sum_{a\in A}X_a where XaX_a are pairwise independent 0-1 variables. Let μ=E[X]\mu=\mathbb E[X]. Then Var(X)μ\text{Var}(X)\le\mu and for any q>0q\gt0,

Pr[Xμqμ]1/q2\Pr[|X-\mu|\ge q\sqrt\mu]\le1/q^2

Proof by Chebyshev’s inequality

Multiply-mod-prime

The classic strongly universal hash scheme is a multiply-mod-prime scheme. For some prime pp, uniformly at random we pick (a,b)[p]2(a,b)\in[p]^2 and define ha,b:[p][p]h_{a,b}:[p]\rarr[p] by

ha,b(x)=(ax+b) mod p.h_{a,b}(x)=(ax+b)\text{ mod }p.

Each pair ha,b(x)=qh_{a,b}(x)=q and ha,b(y)=rh_{a,b}(y)=r happens with probability 1/p21/p^2.

Multiply-shift

For any bit-string zz and integers j>i0,z[i,j)=z[i,j1]j\gt i\ge0,z[i,j)=z[i,j-1] denotes the number represented by bits i,,j1i,\cdots,j-1, so

z[i,j)=(z mod 2j)/2i.z[i,j)=\lfloor(z\text{ mod }2^j)/2^i\rfloor.

To get strongly universal hashing [2w][2][2^w]\rarr[2^\ell], we may pick any ww+1\overline w\ge w+\ell-1. For any pair (a,b)[w]2(a,b)\in[\overline w]^2, we define ha,b:[2w][2]h_{a,b}:[2^w]\rarr[2^\ell] by

ha,b(x)=(ax+b)[w,w).h_{a,b}(x)=(ax+b)[\overline w-\ell,\overline w).

It is strongly universal, proof omitted.

Beyond Strong Universality

There are more advanced algorithmic applications that need more powerful hash functions.

k-independent hash functions: A hash random function H:U[m]H:U\rarr[m] is k-independent if for any distinct keys x1,,xk[u]x_1,\cdots,x_k\in[u], the hash values H(x1),,H(xk)H(x_1),\cdots,H(x_k) are independent random variables, each uniformly distributed in [m][m].

From prime pp, we can implement a k-independent H:[p][p]H:[p]\rarr[p] using kk random coefficients a0,,ak1[p]a_0,\cdots,a_{k-1}\in[p], defining

H(x)=i=0k1aixi mod p.H(x)=\sum_{i=0}^{k-1}a_ix^i\text{ mod }p.

However, there is no efficient implementation of k-independent hashing on complex objects such as variable length strings. What we can do instead is to use the signature idea from Section 2, and first hash the complex keys to unique hash values from a limited integer domain [u][u].