Student: Khanh Do Ba School: Dartmouth College Email: kdb [at] cs [dot] dartmouth [dot] edu Research Area: Project Name: Advisors:

### Project Background and Description

Data stream mananagement systems (DSMSs) and some database management systems (DBMSs) often must deal with massive flows of data much too long to store locally yet need to be mined for useful information. That is, any processing must require only sublinear (ideally polylogarithmic) space and be performed in a single pass. In particular, most functions on the data will therefore be impossible to compute exactly, and must be approximated to some degree of accuracy using randomization.

A data stream can be modeled in its simplest form as a sequence of integers within the range from 1 to a given n, with a query being some statistic on this sequence that needs to be estimated. One such family of functions that have been extensively studied consists of the frequency moments. If mi is the number of occurrences of i in the sequence, then the kth frequency moment is defined by

.

Algorithms exist that will estimate F0, F1 and F2 in logarithmic space [1], and it has been proven that Fk for k ≥ 3 requires polynomial space [2].

The first part of our project this summer was designing an algorithm to approximate the quantity

which we call the entropy norm. Although closely related to the empirical (Shannon) entropy, which, defined over the implied distribution of the input stream, is given by

,

it is structurally closer to the frequency moments.

The second part of our project involved designing an algorithm to approximate the empirical entropy H above. It is easy to check that given FH exactly, as well as m (which requires only a simple counter), we can compute H exactly. However, since we are only able to estimate FH, it turns out to not be possible to estimate H well in this indirect manner. An efficient algorithm to directly approximate H was therefore the goal.

Both these statistics, and particularly the entropy, are natural ones to compute and find applications in much recent work in the networks and database communities [3, 4, 5]. To the best of our knowledge, there were no existing algorithms that efficiently and effectively estimates either.

### Results

Based closely on the algorithm of Alon et al. [1] for estimating the frequency moments, we have devised

a (1 + e)-approximation algorithm for the entropy norm that requires only polylogarithmic space and a single pass, provided FH is sufficiently large. We also proved that if indeed FH is too small to satisfy said condition, no polylogarithmic space algorithm can approximate FH to even a constant factor. In other words, up to polylogarithmic factors, our algorithm is optimal.

We also designed a one-pass (1 + e)-approximation algorithm for the empirical entropy that uses space O(m2/3), and a two-pass (1 + e)-approximation algorithm that uses polylogarithmic space.

Only the first algorithm (for FH) was complete by the end of the REU, at which point we gave this final presentation. We have since published our results as a technical report (DIMACS-TR-2005-33), as well as submitted them for refereed publication. The latest copy of the paper can be found here. We also presented our work at the Third Annual Dartmouth Undergraduate Research Poster Session.

Update: Our paper, Estimating Entropy and Entropy Norm on Data Streams, has been accepted to the 23rd International Symposium on Theoretical Aspects of Computer Science (STACS 2006) and the journal Internet Mathematics.

### References

[1] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. ACM STOC, 1996.

[2] A. Chakrabarti, S. Khot, X. Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. IEEE CCC, 2003.

[3] Y. Gu, A. McCallum and D. Towsley. Detecting anomalies in network traffic using maximum entropy estimation. Internet Measurement, 2005.

[4] A. Wagner and B. Plattner. Entropy based worm and anomaly detection in fast IP networks. IEEE WET ICE, STCA Security Workshop, 2005.

[5] K. Xu, Z. Zhang, and S. Bhattacharya. Profiling internet backbone traffic: behavior models and applications. ACM SIGCOMM, 2005.

 My current website can be found here.