DIMACS logo

Justin Thaler

REU 2007

I'm working with Drs. S. Muthu Muthukrishnan and Dr. Graham Cormode on algorithms for analyzing massive data streams. A massive data stream is a sequence of information that is so long it cannot be stored completely in memory, and as a result algorithms on data streams should not just be time efficient, but space efficient as well. According to Estimating entropy and entropy norm on data streams by Chakrabarti, Ba, and Muthukrishnan, "Sublinear space is mandatory and polylogarithmic space is often the goal."

Specifically, my project involves Streaming Entropy computation. The entropy of a data stream can be informally viewed as the amount of information contained in single character of the stream: the more "random" the data stream, the higher its entropy. One application of fast entropy computation on massive data streams is in detecting worms and other anomalies in IP networks. According to Entropy Based Worm and Anomaly Detection in Fast IP Networks by Wagner and Plattner, "The connection between entropy and worm propagation is that worm traffic is more uniform or structured than normal traffic in some respects and more random in others."

I'm currently working on speeding up an algorithm for entropy computation presented in A Near-optimal algorithm for computing the entropy of a stream by Chakrabarti, Cormode, and McGregor. I’ve implemented three versions of the algorithm: a “slow” version that adheres to the pseudocode in the aforementioned paper, a “fast” version, and a “naïve” version that does not take backup samples and hence is not guaranteed to work well on streams with low entropies. An easy-to-use implementation in C of the all three versions of this algorithm may be found here.

Excel files containing results from experiments run using all three implementations, as well as a writeup of the trends observed in these experiments, can be found here. The files titled “timevlengthfast.xls”, “timevlengthslow.xls”, and “timevlengthnaive.xls” each contain two graphs, both with length as the dependent variable and time or percent error as the independent variable. The files “timevzipffast.xls”, “timevzipfslow.xls”, and “timevzipfnaive.xls” also have two graphs, both with the zipf parameter (skewness of the distribution) as the dependent variable. The fast version of this algorithm performed extremely well in these experiments, achieving excellent accuracy and processing 400,000-1,000,000 tokens/second.