Streaming setting: Data stream x1,x2,⋯,xn with xi∈[m]. The available memory is O(logcn). Sub-linear space.
Algorithm: Approximates frequencies for the top k items.
Deterministic Algorithm
To estimate item frequencies fj within an additive error of n/k using with O(klogn) memory. (logn 这里指的是内存的比特数量。)
Maintain set S of k counters, initialize to 0. For each element xi in stream
If xi∈S, increment the counter of xi.
If xi∈S, add xi to S is space is available, else decrement all counters in S.
An item in S whose count falls to 0 can be removed, the space requirement for storing k counters is klogn and the update time per item is O(k). The algorithm estimates the count of an item as the value of its counter or zero if it has no counter.
🤔Claim 1: The frequency estimate nj produced by the algorithm satisfies fj−n/k≤nj≤fj.
Clearly, nj is less than the true frequency fj. Differences between fj and the value of the estimate are caused by one of the two scenarios:
The item j∈S, each counter in S gets decremented, this is the case when xj occurs in the stream but the counter for j is not incremented.
The counter for j gets decremented due to an element j′ that is not contained in S.
Both scenarios result in k counters getting decremented hence they can occur at most n/k times, showing that nj≥fj−n/k.
在所有计数器减少前,至少要添加 k 次。对于所有计数器的和,每次减少操作都会减少 k。所有计数器的和最多减少 n/k 次。所以单个计数器也最多减少 n/k 次。
也就是当所有 item 出现的频率都一样(都是 n/k)的时候,取等号。
Count Min Sketch
The turnstile model (旋转门模型) allows both additions and deletions of items in the stream.
The stream consists of pairs (i,ci), where the i∈[m] is an item and ci is the number of items to be added or deleted. The count of an item can not be negative at any stage, the frequency fj of item j is fj=∑cj.
The following algorithm estimates frequencies of all items up to an additive error of ϵ∣f∣1 with probability 1−δ, the ℓ1 norm ∣f∣1 is the number of items present in the data set. The two parameters k and t in the algorithm are chosen to be ϵ2,log(1/δ).
Space usage: O(ϵ1logn)
The algorithm works as follows:
Maintain t arrays A[i] each having k counters, hash function hi:U→[k] drawn from a 2-wise independent family H is associated to array A[i].
For element (j,cj) in the stream, update counters as follows:
A[i,hi(j)]←A[i,hi(j)]+cj∀i∈[t]
The frequency estimate for item j is mini∈[t]A[i,h(j)].
The output estimate is always more than the true value of fj as the count of all the items in the stream is non-negative.
Analysis
To bound the error in the estimate for fj we need to analyze the excess X where A[1,h1(j)]=fj+X. The excess X an be a sum of random variables X=∑iYi where Yi=fi if h1(j)=h1(i) and 0 otherwise. As h1∈H is chosen uniformly at random from a pair-wise independent hash function family, E[Yi]=fi/k.
The probability that all the excesses at A[i,hi(xi)] are greater than ϵ∣f∣1 is at most 2t1≤δ as t was chosen to be log(1/δ). The algorithm estimates the frequency of item xj up to an additive error ϵ∣f∣1 with probability 1−δ.
The memory required for the algorithm is the sum of the space for the array and the hash functions, O(ktlogn+tlogm)=O(ϵ1log(1/δ)logn). The update time per item in the stream is O(logδ1).
Count Sketch
Another sketch algorithm with error in terms of the ℓ2 norm ∣f∣2=∑jfj2.
The relation between the ℓ1 and ℓ2 norm is n1∣f∣1≤∣f∣2≤∣f∣1, the ℓ2 norm is less than the ℓ1 norm so the guarantee for this algorithm is better than that for the previous one.
The algorithm works as follows:
Maintain t arrays A[i] each having k counters, hash function: gi:U→{−1,1} and hi:U→[k] drawn uniformly at random from a pair-wise independent families are associated to array A[i].
For element (j,cj) in the stream, update counters as follows
A[i,hi(j)]←A[i,hi(j)]+gi(j)cj∀i∈[t]
The frequency estimate for item j is the median over the t arrays of gi(xj)A[i,h(j)].
Analysis
The entry A[1,h1(j)]=g1(j)fj+X, we examine the contribution X from the other items by writing X=∑iYi where Y is ±fi if h1(i)=h1(j) and 0 otherwise. Note that E[Yj]=0, so the expected value of g1(j)A[1,h(j)] is fj.
The random variable Yi are pairwise independent as h1 is a pair-wise independent hash function, so the variance of X:
Var(X)=i∈[m]∑Var(Yi)=i∈[m]∑kfi2=k∣f∣22.
By Chebyshev’s inequality,
Pr[∣X−μ∣>Δ]≤Δ2Var(X).
The mean μ=0 and the variance is k∣f∣22, choosing δ=ϵ∣f∣2 and k=4/ϵ2, we have
Pr[∣X−μ∣>ϵ∣f∣2]≤ϵ2k1≤41.
For t=θ(log(1/δ)), the probability that the median value deviates from μ by more than ϵ∣f∣2 is less than δ by a Chernoff Bound. That is, the probability that there are fewer than t/2 success in a series of t tosses of a coin with success probability 43 is smaller than δ for t=O(log(δ1)).
Arguing as in the count min sketch the space required is O(ϵ21logδ1logn) and the update time per item is O(logδ1).
Remarks
The count sketch approximates fj within ϵ∣f∣2 but requires O(ϵ21) space,
while the count min sketch approximates fj within ϵ∣f∣1 and requires O(ϵ1).
The approximation provided by the sketch algorithm is meaningful only for items that occur with high frequency.