Student: Yinmeng Zhang School: Carnegie Mellon University Email: ynz@andrew.cmu.edu Research Area(s): Data Streams Project Name: Zipf Analysis of Algorithms for Approximating Quantiles Faculty Advisor: Dr. S. Muthu Muthukrishnan, DIMACS Rutgers University Dr. Graham Cormode, DIMACS Rutgers University

### Project Description

Recently there has been a lot of interest in using a small sample to approximate very large distributions. As technology has advanced, we are faced with massive data streams, especially from the internet, e.g. email data and IP traffic logs, from which we wish to glean useful information.

There are many different questions that we could ask; for example, what are the k most frequent elements in the stream, or how many times did a given element appear, or what was the median element. What makes these questions interesting are the severe space constraints. Because of the size of the data streams, we cannot store all the data! It is only feasible to keep polylogarithmic space. Because of the data stream model, we can often only afford to make a single pass over the data. These constraints make the seemingly mundane summary problems extremely interesting, not to mention difficult. For example, it is known (Munro, Paterson) that to find the exact median of a set of numbers in a single pass requires Omega(N) space. However, this begs the question why we would need the exact median in the first place; as we are attempting to make some sort of high level observation of the data, it should be, and often is, sufficient to, say, give a number which is within epsilon*N of the actual median. Furthermore, many data stream algorithms offer probabilistic guarantees; that is, the output is correct within epsilon with a probability of 1-delta.

This summer I considered specifically the Count-Min Sketch, a sketch algorithm which answers point queries, e.g. how frequent was a specified element, and can be used to find heavy hitters, e.g. elements which appeared more than some phi percent of the time. The algorithm uses space O(1/epsilon log(1/delta)) given no assumptions about the input. However, we often do know something about the distribution of the input; in practice, many a Zipfian distribution is common. A Zipfian distribution looks like 1/q^z, where the frequency of the occurance of the qth ranked item is proportional to 1/q^z. Note that when z=0, we get the uniform distribution back. However, when z is large, we see that a few elements account for the majority of the distribution, while most elements have negligible effect. Intuitively, this might allow for savings in space. In fact, for z>1, we find the CM sketch only requires O(1/epsilon^z log(1/delta)) space, though it is not clear that any savings can be made for smaller smaller values of z. It would also be interesting to attempt to apply the same techniques to other algorithms, such as, for example, the Greenwald-Khanna algorithm, which makes quantile summaries.