随机算法 3 Hash Functions
Lecturer: Kasper
Lecture notes is based on High Speed Hashing for Integers and Strings by Mikkel Thorup
Hash Functions
Universe . Mapping randomly to a range .
Truly random hash function .
is idealized, can not be implemented. To represent a truly random hash functions, we need to store at least bits. (too large, impossible!)
Idea: hash functions contain only a small element or seed of randomness so that the hash function is sufficiently random for the desired application. e.g. Prime number , for random hash function could be where are random numbers that together form the random seed describing the function.
Definition and Properties
Definition 1: A hash function is a random variable in the class of all functions , i.e., it consists of a random variable for each .
3 things we care about :
- Space. The size of the random seed that is necessary to calculate given .
- Speed. The time it takes to calculate given .
- Properties of random variables.
Universal Hashing
To generate a random hash function from a key universe to a set of hash values . We think of as a random variable following some distribution over functions . We want to be universal which means that for any given distinct keys , when is picked at random, we have low collision probability:
For many applications, it suffices if for some , we have
Then is called c-approximately universal.
Exercise 2.1 Is the truly independent hash function universal?
Yes.
Exercise 2.2 If a hash function has collision probability 0, how large must be?
.
Exercise 2.3 Let . Is the identity function a universal hash function ?
Yes.
Applications
hash tables with chaining. (detail see in Lecture 2)
Multiply-mod-prime
Assume and .
The classic universal hash function is based on a prime number . We pick a uniformly random and , and define by
Given any distinct , we want for random :
⚠️Strict inequality! Strictly better.
Fact: If is prime and then .
Let be given. For given pair , define by
Lemma: These two equations define a 1-1 correspondence between pairs and .
(Use Fact to prove)
Assume that there exist two different pairs and which corresponds to same .
For , by the Fact above, it yields .
For , it yields .
Thus contradicts assumption.
Thus its bijection.
Since , by Fact above, . Thus when we pick , we get .
Why the inequality is strict?
We get a collision iff .
Let us fix and set . There are at most values with and one of them is . Therefore, the number of with is at most . However, there are values of with only values of with , and then the number of with is . Summing over all values of , we get that the number of with is strictly less than . Then 1-1 correspondence implies that there are strictly less than collision pairs . Since each of the pairs from are equally likely, we conclude that the collision probability is strictly below .
Implementation for 64-bit Keys
Hashing scheme
.
Since , we generally need to reserve more than 64 bits for , so the product has more than 128 bits. We can get the product of 32-bit numbers, representing them as 64-bit numbers, and getting the full 64-bit result. We need at least 6 such 64-bit multiplications to compute .
How do we compute ? For 64-bit numbers, we have a general mod-operation. Idea: Let be a Mersenne prime , e.g. .
To prove this, it suffices to prove: . (Obviously)
1 | //p=pow(2,q)-1 |
Multiply-shift
Hashing from -bit integers to -bit integers.
We pick a uniformly random odd -bit integer , and then we compute , as
1 | //(a*x%pow(2,w))>>(w-l) |
This scheme is faster than . Numbers are stored as bit strings, with the least significant bit to the right. Integer division by a power of 2 is accomplished by a right shift. For hashing 64-bit integers, we further exploit that 64-bit multiplication automatically discards overflow, which is the same as multiplying modulo .
Multiply-shift is 2-approximately universal, i.e., for ,
To prove this, it suffices to prove that the probability that bis of are all 0s or all 1s is at most .
Fact: If is odd and , then .
Proof omitted.证明的大致思路:假设 。显然,如果要碰撞,那么 中, 对应第 到第 这 位比特必须全 0 或者全 1。再证全 0 和全 1 的概率是 .
再用上面的 Fact “任何奇数和 2 的多次幂互质”证明 对应第 到第 这 位比特是均匀分布即可。
Strong Universality
Strong universality: For , we consider pairwise events of the form that for given distinct keys and possibly non-distinct hash values , we have and . We say a random hash function is strongly universal if the probability of every pairwise event is .
We note that if is strongly universal, it is also universal since
\Pr[h(x)=h(y)]=\sum_{q\in[m]}\Pr[h(x)=q\and h(y)=q]=m/m^2=1/m
Observation: An equivalent definition of strong universality is that each key is hashed uniformly into , and that every two distinct keys are hashed independently.
Proof omitted.
Emphasizing the independence, strong universality is also called 2-independence, as it concerns a pair of two events. As for universality, we may accept some relaxed notion of strong universality e.g. 3-independence, k-independence
c-approximately strongly universality: is c-approximately strongly universal if
- is c-approximately uniform, meaning for every and for every hash value , we have ;
- Every pair of distinct keys hash independently.
Applications
Coordinated sampling: allocation of strongly universal hashing.
Basic idea: small samples can reason about the similarity of huge sets.
First we consider sampling from a single set using a strongly universal hash function and a threshold . We now sample if , which by uniformity happens with probability for any .
Let denote the resulting sample from . Then by linearity of expectation, . Conversely, this means that if we have , then we can estimate as .
The important application is not just sampling from a single set, but rather the sampling from different sets and so that we can later reason about the similarity, estimating the sizes of their union and intersection .
We can then estimate the size of the union and intersection multiplying the corresponding sample sizes by .
Sampling from different sets can be done in a distributed way.
Application: In machine learning, we can store the samples of many different large sets; On the Internet, routers can store samples of the packets passing through.
To get a fixed sampling probability for each , we only need that is uniform. This ensures that the estimate of is unbiased.
For , let . Then , but the reasoning below applies when is any 0-1 indicator variable that depends only .
Because is strongly universal, for any distinct , we have that are independent, and hence so are . Therefore is the sum of pairwise independent 0-1 variables. Now the following concentration bound applies to .
Lemma: Let where are pairwise independent 0-1 variables. Let . Then and for any ,
Proof by Chebyshev’s inequality
Multiply-mod-prime
The classic strongly universal hash scheme is a multiply-mod-prime scheme. For some prime , uniformly at random we pick and define by
Each pair and happens with probability .
Multiply-shift
For any bit-string and integers denotes the number represented by bits , so
To get strongly universal hashing , we may pick any . For any pair , we define by
It is strongly universal, proof omitted.
Beyond Strong Universality
There are more advanced algorithmic applications that need more powerful hash functions.
k-independent hash functions: A hash random function is k-independent if for any distinct keys , the hash values are independent random variables, each uniformly distributed in .
From prime , we can implement a k-independent using random coefficients , defining
However, there is no efficient implementation of k-independent hashing on complex objects such as variable length strings. What we can do instead is to use the signature idea from Section 2, and first hash the complex keys to unique hash values from a limited integer domain .