Lecture notes adopted from UCB CS270

Streaming Algorithms: Frequent Items

Streaming setting: Data stream x1,x2,,xnx_1,x_2,\cdots,x_n with xi[m]x_i\in[m]. The available memory is O(logcn)O(\log^cn). Sub-linear space.

Algorithm: Approximates frequencies for the top kk items.

Deterministic Algorithm

To estimate item frequencies fjf_j within an additive error of n/kn/k using with O(klogn)O(k\log n) memory. (logn\log n 这里指的是内存的比特数量。)

  • Maintain set SS of kk counters, initialize to 00. For each element xix_i in stream
    • If xiSx_i\in S, increment the counter of xix_i.
    • If xi∉Sx_i\not\in S, add xix_i to SS is space is available, else decrement all counters in SS.

An item in SS whose count falls to 00 can be removed, the space requirement for storing kk counters is klognk\log n and the update time per item is O(k)O(k). The algorithm estimates the count of an item as the value of its counter or zero if it has no counter.

🤔Claim 1: The frequency estimate njn_j produced by the algorithm satisfies fjn/knjfjf_j-n/k\le n_j\le f_j.

Clearly, njn_j is less than the true frequency fjf_j. Differences between fjf_j and the value of the estimate are caused by one of the two scenarios:

  1. The item j∉Sj\not\in S, each counter in SS gets decremented, this is the case when xjx_j occurs in the stream but the counter for jj is not incremented.
  2. The counter for jj gets decremented due to an element jj' that is not contained in SS.

Both scenarios result in kk counters getting decremented hence they can occur at most n/kn/k times, showing that njfjn/kn_j\ge f_j-n/k.

在所有计数器减少前,至少要添加 kk 次。对于所有计数器的和,每次减少操作都会减少 kk。所有计数器的和最多减少 n/kn/k 次。所以单个计数器也最多减少 n/kn/k 次。

也就是当所有 item 出现的频率都一样(都是 n/kn/k)的时候,取等号。

Count Min Sketch

The turnstile model (旋转门模型) allows both additions and deletions of items in the stream.

The stream consists of pairs (i,ci)(i,c_i), where the i[m]i\in[m] is an item and cic_i is the number of items to be added or deleted. The count of an item can not be negative at any stage, the frequency fjf_j of item jj is fj=cjf_j=\sum c_j.

The following algorithm estimates frequencies of all items up to an additive error of ϵf1\epsilon|f|_1 with probability 1δ1-\delta, the 1\ell_1 norm f1|f|_1 is the number of items present in the data set. The two parameters kk and tt in the algorithm are chosen to be 2ϵ,log(1/δ){2\over\epsilon},\log(1/\delta).

Space usage: O(1ϵlogn)O({1\over\epsilon}\log n)

The algorithm works as follows:

  1. Maintain tt arrays A[i]A[i] each having kk counters, hash function hi:U[k]h_i:U\rarr[k] drawn from a 22-wise independent family H\mathcal H is associated to array A[i]A[i].

  2. For element (j,cj)(j,c_j) in the stream, update counters as follows:

    A[i,hi(j)]A[i,hi(j)]+cji[t]A[i,h_i(j)]\gets A[i,h_i(j)]+c_j\forall i\in[t]

  3. The frequency estimate for item jj is mini[t]A[i,h(j)]\min_{i\in[t]}A[i,h(j)].

The output estimate is always more than the true value of fjf_j as the count of all the items in the stream is non-negative.

Analysis

To bound the error in the estimate for fjf_j we need to analyze the excess XX where A[1,h1(j)]=fj+XA[1,h_1(j)]=f_j+X. The excess XX an be a sum of random variables X=iYiX=\sum_iY_i where Yi=fiY_i=f_i if h1(j)=h1(i)h_1(j)=h_1(i) and 00 otherwise. As h1Hh_1\in\mathcal H is chosen uniformly at random from a pair-wise independent hash function family, E[Yi]=fi/k\mathbb E[Y_i]=f_i/k.

E[X]=f1k=ϵf12.\mathbb E[X]={|f|_1\over k}={\epsilon|f|_1\over2}.

E[X]=Eh[ji,h(i)=h(j)fj]=Eh[jjifj1h(i)=h(j)]=jjiE[1h(i)=h(j)]fj=jjiPr[h(i)=h(j)]fj=jjifjkjfjk=f1k\mathbb E[X]=\mathbb E_h[\sum_{j\neq i,h(i)=h(j)}f_j]=\mathbb E_h[\sum_{j|j\neq i}f_j\cdot\mathbf 1_{h(i)=h(j)}]\\ =\sum_{j|j\neq i}\mathbb E[\mathbf 1_{h(i)=h(j)}]\cdot f_j=\sum_{j|j\neq i}\Pr[h(i)=h(j)]\cdot f_j=\sum_{j|j\neq i}{f_j\over k}\le{\sum_jf_j\over k}={|f|_1\over k}

Applying Markov’s inequality, we have

Pr[X>ϵf1]12.\Pr[X\gt\epsilon|f|_1]\le{1\over2}.

The probability that all the excesses at A[i,hi(xi)]A[i,h_i(x_i)] are greater than ϵf1\epsilon|f|_1 is at most 12tδ{1\over2^t}\le\delta as tt was chosen to be log(1/δ)\log(1/\delta). The algorithm estimates the frequency of item xjx_j up to an additive error ϵf1\epsilon|f|_1 with probability 1δ1-\delta.

The memory required for the algorithm is the sum of the space for the array and the hash functions, O(ktlogn+tlogm)=O(1ϵlog(1/δ)logn)O(kt\log n+t\log m)=O({1\over\epsilon}\log(1/\delta)\log n). The update time per item in the stream is O(log1δ)O(\log{1\over\delta}).

Count Sketch

Another sketch algorithm with error in terms of the 2\ell_2 norm f2=jfj2|f|_2=\sqrt{\sum_jf_j^2}.

The relation between the 1\ell_1 and 2\ell_2 norm is 1nf1f2f1{1\over\sqrt n}|f|_1\le|f|_2\le|f|_1, the 2\ell_2 norm is less than the 1\ell_1 norm so the guarantee for this algorithm is better than that for the previous one.

The algorithm works as follows:

  1. Maintain tt arrays A[i]A[i] each having kk counters, hash function: gi:U{1,1}g_i:U\rarr\{-1,1\} and hi:U[k]h_i:U\rarr[k] drawn uniformly at random from a pair-wise independent families are associated to array A[i]A[i].

  2. For element (j,cj)(j,c_j) in the stream, update counters as follows

    A[i,hi(j)]A[i,hi(j)]+gi(j)cji[t]A[i,h_i(j)]\gets A[i,h_i(j)]+g_i(j)c_j\forall i\in[t]

  3. The frequency estimate for item jj is the median over the tt arrays of gi(xj)A[i,h(j)]g_i(x_j)A[i,h(j)].

Analysis

The entry A[1,h1(j)]=g1(j)fj+XA[1,h_1(j)]=g_1(j)f_j+X, we examine the contribution XX from the other items by writing X=iYiX=\sum_iY_i where YY is ±fi\pm f_i if h1(i)=h1(j)h_1(i)=h_1(j) and 00 otherwise. Note that E[Yj]=0\mathbb E[Y_j]=0, so the expected value of g1(j)A[1,h(j)]g_1(j)A[1,h(j)] is fjf_j.

fi^=g(i)A[h(i)]=g(i)(g(i)fi+ji,h(j)=h(i)g(j)fj)=fi+ji,h(j)=h(i)g(i)g(j)fj=fi+XE[X]=E[ji,h(j)=h(i)g(i)g(j)fj]=E[ji1h(i)=h(j)g(i)g(j)fj]=jiE[1h(i)=h(j)g(i)g(j)]fj=jiE[1h(i)=h(j)]E[g(i)]E[g(j)]fj=0\hat{f_i}=g(i)\cdot A[h(i)]=g(i)\cdot(g(i)\cdot f_i+\sum_{j\neq i,h(j)=h(i)}g(j)\cdot f_j)\\ =f_i+\sum_{j\neq i,h(j)=h(i)}g(i)g(j)f_j\\ =f_i+X\\ \mathbb E[X]=\mathbb E[\sum_{j\neq i,h(j)=h(i)}g(i)g(j)f_j]=\mathbb E[\sum_{j\neq i}\mathbf1_{h(i)=h(j)}g(i)g(j)f_j]\\ =\sum_{j\neq i}\mathbb E[\mathbf1_{h(i)=h(j)}g(i)g(j)]f_j=\sum_{j\neq i}\mathbb E[\mathbf 1_{h(i)=h(j)}]\mathbb E[g(i)]\mathbb E[g(j)]f_j=0\\

The random variable YiY_i are pairwise independent as h1h_1 is a pair-wise independent hash function, so the variance of XX:

Var(X)=i[m]Var(Yi)=i[m]fi2k=f22k.Var(X)=\sum_{i\in[m]}Var(Y_i)=\sum_{i\in[m]}{f^2_i\over k}={|f|_2^2\over k}.

By Chebyshev’s inequality,

Pr[Xμ>Δ]Var(X)Δ2.\Pr[|X-\mu|\gt\Delta]\le{Var(X)\over\Delta^2}.

The mean μ=0\mu=0 and the variance is f22k|f|_2^2\over k, choosing δ=ϵf2\delta=\epsilon|f|_2 and k=4/ϵ2k=4/\epsilon^2, we have

Pr[Xμ>ϵf2]1ϵ2k14.\Pr[|X-\mu|\gt\epsilon|f|_2]\le{1\over\epsilon^2k}\le{1\over4}.

For t=θ(log(1/δ))t=\theta(\log(1/\delta)), the probability that the median value deviates from μ\mu by more than ϵf2\epsilon|f|_2 is less than δ\delta by a Chernoff Bound. That is, the probability that there are fewer than t/2t/2 success in a series of tt tosses of a coin with success probability 343\over4 is smaller than δ\delta for t=O(log(1δ))t=O(\log({1\over\delta})).

Arguing as in the count min sketch the space required is O(1ϵ2log1δlogn)O({1\over\epsilon^2}\log{1\over\delta}\log n) and the update time per item is O(log1δ)O(\log{1\over\delta}).

Remarks

  • The count sketch approximates fjf_j within ϵf2\epsilon|f|_2 but requires O~(1ϵ2)\widetilde O({1\over\epsilon^2}) space,

  • while the count min sketch approximates fjf_j within ϵf1\epsilon|f|_1 and requires O~(1ϵ)\widetilde O({1\over\epsilon}).

  • The approximation provided by the sketch algorithm is meaningful only for items that occur with high frequency.