随机算法 6 Streaming Heavy Hitters
Lecture notes by Kasper.
Review of Count Sketch
This course: extends the results for count sketch algorithm (Lecture 5).
Strict turnstile model, where we maintain estimates of some underlying frequency vector . The vector has coordinates indexed by integers in for some universe size .
We receive updates one at a time.
An update: and has the effect of updating for and .
Assume: is bounded by a polynomial in i.e. . (Weight change fits in bits)
We also promise that at any given time, .
In the count sketch, we learned how to use count min sketch to support frequency estimation queries i.e. we have a data structure using space that upon receiving an index as query, can return a value that with probability at least satisfies where .
update time for count min sketch:
query time for count min sketch:
Problem Formulation
In many applications, we concern about finding heavy hitters rather than estimating frequency.
An heavy hitter is an index s.t. .
We still receive frequency updates. However, there is just one query. To answer a query, we must produce a list satisfying the following:
- All heavy hitters are in , i.e., all indices with are in .
- No index with is in .
The algorithm is allowed to fail with probability .
We have no requirements on indices s.t. .
Solution of Cormode and Muthukrishannan
For simplicity, assume is a power of .
A complete binary tree on top of (one leaf per index ). Such a tree has levels and we denote the level of the leaves by level and the root’s level by . For a node , let denote the sum of ’s over all corresponding to leaves in the sub-tree rooted at .
We think of the list of values for the nodes at level as forming another frequency vector on a universe of size . We use to denote this vector of frequencies for the -th level of the tree.
Obviously , the frequency for a node is also just the sum of the frequencies of its two children.
We focus on the case where we are aiming at an error probability of , i.e., polynomially small in the universe size. The data structure is as follows:
For every level of , we store a count min sketch of having error probability and approximation . On a update , we conceptually add to the frequency of all nodes on the root-to-leaf path leading to the leaf corresponding to .
This can be done by adding to for . Thus an update makes updates on count min data structures, each with error probability thus the total update time is . similarly, we get a space usage of .
The query algorithm: Start at the root node . When visiting a node at level , if (leaf), add to the output list. Otherwise, is an internal node at some level . Query the count min sketch on the vector to obtain estimates of the frequencies stored at the two children of . Recurse into those children for which the estimate is at least .
The query algorithm works. First define a node in to be heavy if its frequency is at least . Otherwise call it light. Since we are in the strict turnstile model and all frequencies in are non-negative, it follows that . We have some observations as follows.
Observation 1. For any level in , there are no more than heavy nodes.
If we had more than heavy nodes at some level , then we would have , a contradiction.
Observation 2. If is the index of an heavy nodes at some level , then all the nodes on the root-to-leaf path leading to leaf corresponding to are heavy.
Since all frequencies are non-negative, it follows that the frequency of an internal node is always at least as large as any of its children’s.
By observation 2, the list produced by the query algorithm always contains all the heavy hitters. Now let be an index with . Note that the leaf corresponding to is light. It follows that if is added to , then the query algorithm visited ’s parent and made the estimate of ’s frequency that was off by more than . This happens with probability no more than by our setting of parameters for the count min sketches. We can thus union bound over all light leaves and conclude that none of them are added to with probability at least .
To analyze the query time, observe that if no estimate made by a count min sketch fails, the we only visit heavy nodes. By observation 1, there are heavy nodes in . Thus the query time is since we make two queries to count min sketches with error probability in each.
Note that if we want the query time to be worst case, we can simply abort with an empty output list in case we visit more than nodes in since this means that at least one light node was visited, and this happens with probability no more than .
To summarize, we have a data structure with worst case update time , worst case query time and space .
Solution of Larsen, Nelson, Nguyen and Thorup
This solution improves by a factor in all parameters at the cost of moving to expected query time.
The change is simple: Keep the exact same data structure as above, expect that all the count min sketches stored for levels are replaced by count min sketches with error probability instead of . The count min sketch at the leaves still has error probability . The query algorithm and update algorithm remains the same.
The space usage is as each internal level new uses space only .
Similarly, the update time drops to because each internal level costs to update. For the correctness probability, note that the argument from the previous section only argues about queries made to the count min sketch at level , thus the correctness probability still holds.
The only thing changed is the query time due to more nodes in being visited.
We want to bound the expected number of light nodes visited.
Define a light node its light depth as the distance to its nearest heavy ancestor in .
Consider all light nodes at levels . Let be the set of such nodes. If we visit such a node, we spend time making two queries to a count min sketch. Thus if we define an indicator random variable for each such node , taking the value 1 if is visited by the query algorithm, and 0 otherwise, then the expected time spend on visiting all light nodes is
Thus we need to bound . For this, the trick is to use the light depth.
Let be the set of nodes in with light depth for . Then we have
Let and consider .
For to be visited, it must be the case that query made in every single ancestor up to its nearest heavy ancestor failed, i.e. returned an estimate that was off by at least .
Since these queries to independent count min data structures with failure probability each, the probability they all fail is at most .
Hence, we have
Next we bound . (By examine a heavy node)
Any heavy node can be the nearest heavy ancestor of at most light nodes with light depth . This is trivial because any node has no more than descendants levels down.
By observation 1, there are no more than heavy nodes in and therefore we have .
We thus conclude that
as claimed.
It remains to bound the cost of the queries made in light nodes at level . Recall that these query the data structure at level which had an error probability of , and thus the cost of visiting such a light node is rather than . If we this time define as the set of light nodes at level that have a light depth of , then by argument similar to above, the expected time spend visiting nodes is .
Fortunately, the new sets are a factor smaller than before. To see this, note that any node in must have its nearest heavy ancestor at level exactly . But observation 1 tells that there are only heavy nodes at level , and each such node has only descendants at level .
Therefore we have . Inserting this, we conclude that
Summary
Improved space to .
Update time in worst case
Expected query time