General Information

Student: Dan Stubbs
Office: CoRE 450
School: University of Massachusetts, Amherst
E-mail: dstubbs@cs.umass.edu
Project: Standardization of Mergeable Quantile Summaries

Project Description

The primary models for dealing with very large data sets are the streaming model and the distributed model. Fully mergeable summary algorithms function in either setting, using a merge operation to allow approximate summaries over disjoint summaries to be turned into a summary of the union of those sets while maintaining approximation guarantees. There already exist a number of mergeable algorithms, and given that this is a fairly young area of study, each has its own notation, assumptions, and way of viewing space/error tradeoffs. In this project, I'll be looking at the existing algorithms for maintaining mergeable quantile approximation summaries, simulating them, comparing them, and determining which, if any, of these algorithms are identical, which are strictly better, and which are better under specific assumptions.

Weekly Log

Week 1:
I met with Graham, and we discussed possible directions our work might take. Between getting settled in, set up, and working on my presentation, I've been alternating between reading papers and coding up the algorithms therein.
Week 2:
I'm definitely getting into the swing of things, research- and coding-wise. I put together a simulator for the algorithm as presented in the Greenwald and Khanna paper, and, after a little quality time with the paper to figure out how the authors were implying the merging operation should work, I got merge working too. Next I'll use the simulator to play around with the underlying structure of the algorithm and see what's necessary in practice and what's only necessary in theory.
Week 3:
This week began with a visit to IBM, where we heard speakers talk about a variety of research topics being worked on there, including cryptology, parallel processing, and of course, question-answering systems like Watson. All told, they made a pretty strong case for research in industry, and I'm glad to have gone. Research is continuing apace as well, looking into worst case inputs for the algorithm I've been working on as well as exploring some variations. I'm definitely getting to know how it behaves in practice from testing it, which ought to provide helpful intution going forward.
Week 4:
This week I moved back to looking at the paper and its consequences, working more on the underlying theory, strenghtening my understanding of the central proof and many of the concepts that underlie the workings of the algorithm I've been simulating. I also had opportunity to help a student in another group to find an efficient algorithm for his quantile-related difficulties in graph pruning, which did wonders for my feelings of the relevance of my research. The other highlights of the week were excellent talks by Daniel Reem and Swastik Kopparty on Voronoi diagrams and random walks, the latter of which certainly spoke to my probabilistic inclinations and fondness for stochastic matrices.
Week 5:
Graham came back from his trip, and we had a very productive meeting about an alternate construction for part of the algorithm and the central proof I was working on last week. In particular, there was one lemma we looked at whose details currently stand in the way of major changes to the algorithm. I think I've made some progress with that, as well, and functional improvements might be around the corner if it pans out.
Week 6:
This week things have really been coming together. I did get a handle on the proof as a whole, and found ways to change the algorithm that are likely novel, and possibly even useful. It's nice to shift from eliminating strategies that don't work to finally finding one that does.
Week 7:
Worked on improving and writing up my results and prepared my final presentation.
Week 8:
This week I finished and turned in my final report and went home.


Additional Information