Lecture notes by Kasper.

Review of Count Sketch

This course: extends the results for count sketch algorithm (Lecture 5).

Strict turnstile model, where we maintain estimates of some underlying frequency vector ff. The vector ff has coordinates indexed by integers in [U][U] for some universe size UU.

We receive updates one at a time.

An update: (i,Δ)(i,\Delta) and has the effect of updating fifi+Δf_i\gets f_i+\Delta for i[U]i\in[U] and Δ{x}x=MM\Delta\in\{x\}_{x=-M}^M.

Assume: MM is bounded by a polynomial in nn i.e. M=nO(1)M=n^{O(1)}. (Weight change fits in O(logn)O(\log n) bits)

We also promise that at any given time, fi0if_i\ge0\forall i.

In the count sketch, we learned how to use count min sketch to support frequency estimation queries i.e. we have a data structure using O(1ϵlog1δ)O(\frac{1}{\epsilon}\log\frac{1}{\delta}) space that upon receiving an index i[U]i\in[U] as query, can return a value fi^\hat{f_i} that with probability at least 1δ1-\delta satisfies fifi^fi+ϵf1f_i\le\hat{f_i}\le f_i+\epsilon||f||_1 where f1=ifi||f||_1=\sum_i|f_i|.

update time for count min sketch: O(log1δ)O(\log\frac{1}{\delta})

query time for count min sketch: O(log1δ)O(\log\frac{1}{\delta})

Problem Formulation

In many applications, we concern about finding heavy hitters rather than estimating frequency.

An ϵ\epsilon heavy hitter is an index ii s.t. fiϵf1f_i\ge\epsilon||f||_1.

We still receive frequency updates. However, there is just one query. To answer a query, we must produce a list L\mathcal L satisfying the following:

  • All ϵ\epsilon heavy hitters are in L\mathcal L, i.e., all indices ii with fiϵf1f_i\ge\epsilon||f||_1 are in L\mathcal L.
  • No index ii with fi<ϵ2f1f_i\lt\frac{\epsilon}{2}||f||_1 is in L\mathcal L.

The algorithm is allowed to fail with probability δ\delta.

We have no requirements on indices ii s.t. ϵ2f1fi<ϵf1\frac{\epsilon}{2}||f||_1\le f_i\lt \epsilon||f||_1.

Solution of Cormode and Muthukrishannan

For simplicity, assume UU is a power of 22.

A complete binary tree T\mathcal T on top of UU (one leaf per index iUi\in U). Such a tree has logU+1\log U+1 levels and we denote the level of the leaves by level 00 and the root’s level by logu\log u. For a node vTv\in\mathcal T, let gvg_v denote the sum of fjf_j’s over all jj corresponding to leaves in the sub-tree rooted at vv.

We think of the list of values gvg_v for the U/2iU/2^i nodes at level ii as forming another frequency vector on a universe of size U/2iU/2^i. We use fif^i to denote this vector of frequencies for the ii-th level of the tree.

Obviously fji=h=02i1fj2i+hf_j^i=\sum_{h=0}^{2^i-1}f_{j2^i+h}, the frequency for a node is also just the sum of the frequencies of its two children.

We focus on the case where we are aiming at an error probability δ\delta of 1/U1/U, i.e., polynomially small in the universe size. The data structure is as follows:

  • For every level ii of T\mathcal T, we store a count min sketch of fif^i having error probability δ=1/U2\delta'=1/|U|^2 and approximation ϵ=ϵ/2\epsilon'=\epsilon/2. On a update (j,Δ)(j,\Delta), we conceptually add Δ\Delta to the frequency of all nodes on the root-to-leaf path leading to the leaf corresponding to jj.

    This can be done by adding Δ\Delta to fj/2iif^i_{\lfloor j/2^i\rfloor} for i=0,,logUi=0,\cdots,\log U. Thus an update makes O(logU)O(\log U) updates on count min data structures, each with error probability 1/U21/U^2 thus the total update time is O(log2U)O(\log^2U). similarly, we get a space usage of O(1ϵlog2U)O(\frac{1}{\epsilon}\log^2U).

  • The query algorithm: Start at the root node rr. When visiting a node vv at level ii, if i=0i=0 (leaf), add vv to the output list. Otherwise, vv is an internal node at some level i>0i\gt0. Query the count min sketch on the vector fi1f^{i-1} to obtain estimates of the frequencies stored at the two children of vv. Recurse into those children for which the estimate is at least ϵf1\epsilon||f||_1.

    The query algorithm works. First define a node vv in T\mathcal T to be heavy if its frequency is at least (ϵ/2)f1(\epsilon/2)||f||_1. Otherwise call it light. Since we are in the strict turnstile model and all frequencies in ff are non-negative, it follows that f1=f01==flogU1||f||_1=||f^0||_1=\cdots=||f^{\log U}||_1. We have some observations as follows.

Observation 1. For any level ii in T\mathcal T, there are no more than 2/ϵ2/\epsilon heavy nodes.

If we had more than 2/ϵ2/\epsilon heavy nodes at some level ii, then we would have $$||fi||_1\ge(2/\epsilon)(\epsilon/2)||f||_1=||f||_1=||fi||_1$$, a contradiction.

Observation 2. If ii is the index of an ϵ\epsilon heavy nodes at some level ii, then all the nodes on the root-to-leaf path leading to leaf corresponding to ii are heavy.

Since all frequencies are non-negative, it follows that the frequency of an internal node is always at least as large as any of its children’s.\

By observation 2, the list L\mathcal L produced by the query algorithm always contains all the heavy hitters. Now let ii be an index with fi<(ϵ/2)f1f_i\lt(\epsilon/2)||f||_1. Note that the leaf vv corresponding to ii is light. It follows that if ii is added to L\mathcal L, then the query algorithm visited vv’s parent and made the estimate of vv’s frequency that was off by more than (ϵ/2)f1=ϵf1(\epsilon/2)||f||_1=\epsilon'||f||_1. This happens with probability no more than 1/U21/U^2 by our setting of parameters for the count min sketches. We can thus union bound over all light leaves and conclude that none of them are added to L\mathcal L with probability at least 11/U1-1/U.

To analyze the query time, observe that if no estimate made by a count min sketch fails, the we only visit heavy nodes. By observation 1, there are O(1ϵlogU)O(\frac{1}{\epsilon}\log U) heavy nodes in T\mathcal T. Thus the query time is O(1ϵlog2U)O(\frac{1}{\epsilon}\log^2U) since we make two queries to count min sketches with error probability 1/U21/U^2 in each.

Note that if we want the query time to be worst case, we can simply abort with an empty output list in case we visit more than (2/ϵ)(logU+1)(2/\epsilon)(\log U+1) nodes in T\mathcal T since this means that at least one light node was visited, and this happens with probability no more than 1/U1/U.

To summarize, we have a data structure with worst case update time O(log2U)O(\log^2U), worst case query time O(1ϵlog2U)O(\frac{1}{\epsilon}\log^2U) and space O(1ϵlog2U)O(\frac{1}{\epsilon}\log^2U).

Solution of Larsen, Nelson, Nguyen and Thorup

This solution improves by a factor logU\log U in all parameters at the cost of moving to expected query time.

The change is simple: Keep the exact same data structure as above, expect that all the count min sketches stored for levels i>0i\gt0 are replaced by count min sketches with error probability 1/41/4 instead of 1/U21/U^2. The count min sketch at the leaves still has error probability 1/U21/U^2. The query algorithm and update algorithm remains the same.

The space usage is O(1ϵlogU)O(\frac{1}{\epsilon}\log U) as each internal level new uses space only O(1ϵ)O(\frac{1}{\epsilon}).

Similarly, the update time drops to O(logU)O(\log U) because each internal level costs O(1)O(1) to update. For the correctness probability, note that the argument from the previous section only argues about queries made to the count min sketch at level 00, thus the correctness probability still holds.

The only thing changed is the query time due to more nodes in T\mathcal T being visited.

We want to bound the expected number of light nodes visited.

Define a light node vv its light depth d(v)d(v) as the distance to its nearest heavy ancestor in T\mathcal T.

Consider all light nodes at levels i2i\ge2. Let VV be the set of such nodes. If we visit such a node, we spend O(1)O(1) time making two queries to a count min sketch. Thus if we define an indicator random variable XvX_v for each such node vVv\in V, taking the value 1 if vv is visited by the query algorithm, and 0 otherwise, then the expected time spend on visiting all light nodes vVv\in V is

O(E[vVXv])=O(vVE[Xv])=O(vVPr[v is visited]).O(\mathbb E[\sum_{v\in V}X_v])=O(\sum_{v\in V}\mathbb E[X_v])=O(\sum_{v\in V}\Pr[v\text{ is visited}]).

Thus we need to bound Pr[v is visited]\Pr[v\text{ is visited}]. For this, the trick is to use the light depth.

Let ViVV_i\subseteq V be the set of nodes in VV with light depth ii for i=1,,logUi=1,\cdots,\log U. Then we have

O(vVPr[v is visited])=O(i=1logUvViPr[v is visited]).O(\sum_{v\in V}\Pr[v\text{ is visited}])=O(\sum_{i=1}^{\log U}\sum_{v\in V_i}\Pr[v\text{ is visited}]).

Let vViv\in V_i and consider Pr[v is visited]\Pr[v\text{ is visited}].

  • For vv to be visited, it must be the case that query made in every single ancestor up to its nearest heavy ancestor failed, i.e. returned an estimate that was off by at least ϵf1\epsilon'||f||_1.

  • Since these queries to d(v)=id(v)=i independent count min data structures with failure probability 1/41/4 each, the probability they all fail is at most 1/4i1/4^i.

    Hence, we have

    O(i=1logUvViPr[v is visited])=O(i=1logUvVi4i).O(\sum_{i=1}^{\log U}\sum_{v\in V_i}\Pr[v\text{ is visited}])=O(\sum_{i=1}^{\log U}\sum_{v\in V_i}4^{-i}).

Next we bound Vi|V_i|. (By examine a heavy node)

  • Any heavy node can be the nearest heavy ancestor of at most 2i2^i light nodes with light depth ii. This is trivial because any node has no more than 2i2^i descendants ii levels down.

  • By observation 1, there are no more than O(ϵ1logU)O(\epsilon^{-1}\log U) heavy nodes in T\mathcal T and therefore we have ViO(2iϵ1logU)|V_i|\le O(2^i\epsilon^{-1}\log U).

    We thus conclude that

    O(E[vVXv])=O(i=1logUvVi4i)=O(i=1logU2i4i1ϵlogU)=O(1ϵlogUi=1logU2i)=O(1ϵlogUi=1logU2i)=O(ϵ1logU)O(\mathbb E[\sum_{v\in V}X_v])=O(\sum_{i=1}^{\log U}\sum_{v\in V_i}4^{-i})=O(\sum_{i=1}^{\log U}2^i4^{-i}\frac{1}{\epsilon}\log U)\\ =O(\frac{1}{\epsilon}\log U\sum_{i=1}^{\log U}2^{-i})=O(\frac{1}{\epsilon}\log U\sum_{i=1}^{\log U}2^{-i})=O(\epsilon^{-1}\log U)

    as claimed.

It remains to bound the cost of the queries made in light nodes at level i=1i=1. Recall that these query the data structure at level 00 which had an error probability of 1/U21/U^2, and thus the cost of visiting such a light node is O(logU)O(\log U) rather than O(1)O(1). If we this time define ViV_i as the set of light nodes at level 11 that have a light depth of ii, then by argument similar to above, the expected time spend visiting nodes vViv\in V_i is O(i=1logUVi4ilogU)O(\sum_{i=1}^{\log U}|V_i|4^{-i}\log U).

Fortunately, the new sets ViV_i are a factor logU\log U smaller than before. To see this, note that any node in ViV_i must have its nearest heavy ancestor at level exactly 1+i1+i. But observation 1 tells that there are only O(1ϵ)O(\frac{1}{\epsilon}) heavy nodes at level i+1i+1, and each such node has only 2i2^i descendants at level 11.

Therefore we have Vi=O(ϵ12i)|V_i|=O(\epsilon^{-1}2^i). Inserting this, we conclude that

O(i=1logUVi4ilogU)=O(i=1logU1ϵ2i4ilogU)=O(1ϵlogU)O(\sum_{i=1}^{\log U}|V_i|4^{-i}\log U)=O(\sum_{i=1}^{\log U}\frac{1}{\epsilon}2^i4^{-i}\log U)\\ =O(\frac{1}{\epsilon}\log U)

Summary

Improved space to O(1ϵlogU)O(\frac{1}{\epsilon}\log U).

Update time in worst case O(logU)O(\log U)

Expected query time O(1ϵlogU)O(\frac{1}{\epsilon}\log U)