Student: 
Khanh Do Ba 
School: 

Email: 
kdb [at] cs [dot] 
Research Area: 

Project Name: 

Advisors: 
S. Muthu Muthukrishnan, Dept. of Computer Science, Rutgers University Amit Chakrabarti, Dept. of Computer Science, Dartmouth College 
Project Background and Description
Data stream mananagement systems (DSMSs) and some database management systems (DBMSs) often must deal with massive flows of data much too long to store locally yet need to be mined for useful information. That is, any processing must require only sublinear (ideally polylogarithmic) space and be performed in a single pass. In particular, most functions on the data will therefore be impossible to compute exactly, and must be approximated to some degree of accuracy using randomization. A data stream can be modeled in its simplest form as a sequence of integers within the range from 1 to a given n, with a query being some statistic on this sequence that needs to be estimated. One such family of functions that have been extensively studied consists of the frequency moments. If m_{i} is the number of occurrences of i in the sequence, then the k^{th} frequency moment is defined by . Algorithms exist that will estimate F_{0}, F_{1} and F_{2} in logarithmic space [1], and it has been proven that F_{k} for k ≥ 3 requires polynomial space [2]. The first part of our project this summer was designing an algorithm to approximate the quantity
which we call the entropy norm. Although closely related
to the empirical ( , it is structurally closer to the frequency moments. The second part of our project involved designing an algorithm to approximate the empirical entropy H above. It is easy to check that given F_{H} exactly, as well as m (which requires only a simple counter), we can compute H exactly. However, since we are only able to estimate F_{H}, it turns out to not be possible to estimate H well in this indirect manner. An efficient algorithm to directly approximate H was therefore the goal. Both these statistics, and particularly the entropy, are natural ones to compute and find applications in much recent work in the networks and database communities [3, 4, 5]. To the best of our knowledge, there were no existing algorithms that efficiently and effectively estimates either. 
Results
Based closely on the algorithm of Alon et al. [1] for estimating the frequency moments, we have devised a (1 + e)approximation algorithm for the entropy norm that requires only polylogarithmic space and a single pass, provided F_{H} is sufficiently large. We also proved that if indeed F_{H} is too small to satisfy said condition, no polylogarithmic space algorithm can approximate F_{H} to even a constant factor. In other words, up to polylogarithmic factors, our algorithm is optimal. We also designed a onepass (1 + e)approximation algorithm for the empirical entropy that uses space O(m^{2/3}), and a twopass (1 + e)approximation algorithm that uses polylogarithmic space. Only the first algorithm (for F_{H}) was complete by the end of the REU, at which point we gave this final presentation. We have since published our results as a technical report (DIMACSTR200533), as well as submitted them for refereed publication. The latest copy of the paper can be found here. We also presented our work at the Third Annual Dartmouth Undergraduate Research Poster Session. Update: Our paper, Estimating Entropy and Entropy Norm on Data Streams, has been accepted to the 23rd International Symposium on Theoretical Aspects of Computer Science (STACS 2006) and the journal Internet Mathematics. 
References
[1] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. ACM STOC, 1996. [2]
A. Chakrabarti, [3] Y. Gu, A. McCallum and D. Towsley. Detecting anomalies in network traffic using maximum entropy estimation. Internet Measurement, 2005. [4] A. Wagner and B. Plattner. Entropy based worm and anomaly detection in fast IP networks. IEEE WET ICE, STCA Security Workshop, 2005. [5]
K. Xu, Z. Zhang, and S. Bhattacharya. Profiling internet backbone traffic:
behavior models and applications. ACM SIGCOMM, 2005. 
My current website can be found here. 