About me

My name is Dan Halperin, and I am a joint Mathematics and Computer Science major at Harvey Mudd College, member of the class of 2006. I am here at Rutgers University for the summer of 2004 DIMACS REU program, working with Dr. Manish Parashar of the Rutgers Center for Advanced Information Processing (CAIP).

About my project

My project deals with Squid, a decentralized information discovery system that guarantees that any results present in the peer-to-peer (P2P) network will be found, developed by Cristina Schmidt.

In the Squid peer-to-peer framework, the peers (any machine on the network) are organized in a structured overlay that supports efficient lookup. The nodes are given IDs in a range of numbers, and the overlay lets any peer find another peer that corresponds to a given number quickly. Squid uses keyword searches, and it bounds both the length of the keywords and the number of keywords that can be used in a search. If we treat every keyword as a number, then a query, or set of keywords to make a search, can be considered as a point in a multi-dimensional keyword space. A space-filling curve (SFC) runs through the space so that it intersects every point in the space exactly once, and the points in the keyword space are then numbered sequentially based on where they occur on the SFC. A query corresponds to some subregion of the keyword space, and the parts of the SFC that it intersects correspond to the IDs of the peers that are relevant (e.g. store the results for) that search. This way of storing data is called a distributed hash table (DHT). Read the Squid website for more details.

There are a variety of SFCs. One that is used in many applications is the Hilbert SFC. It has been proven to preserve locality better than other known discrete SFCs. To preserve locality is to provide the property that points that are close in the multi-dimensional space are close in the one-dimensional space that the SFC provides a mapping into. The basic idea is that if points are close in the multi-dimensional space, they are probably related to each other somehow, and therefore they should be as close as possible in the one-dimensional space so that their relationship is preserved. The Hilbert SFC was been shown to preserve locality better then other curves when using arbitrary shapes (e.g. circles, convex hulls, arbitrary rectangles, etc) in the multi-dimensional space.

For my project, I looked at one specific application of Squid, text-based queries as in a file-sharing application, and tried to optimize its performance in this situation. For a file-sharing application, we can make certain assumptions that do not hold in general applications.

  • Axes are interchangeable. What this means is that the order of the keywords doesn't matter. If I search for a copy of the U.S. Constitution, whether I search for "Constitution United States" or "United States Constitution", I should get the same results.
  • No ranges. In Squid, there are three kinds of searches: complete queries, wildcard queries, and ranged queries. A complete query would correspond to a single word, for instance "computer". A wildcard query would correspond to part of a word with any possible ending, for instance "comp*", which would match "computer" as well as "computational", "company", "compartment", etc. A ranged query would correspond to a query which allows a specific set of endings, for instance "comp[l-o]" which could be represented as "compl-compo". It can be easily seen that the first two make sense when performing text-based queries, while the ranged query does not make sense in a textual application.

Under these assumptions, I decided to investigate the effects of replacing the Hilbert SFC with a modified Z-Curve. There were a variety of reasons to consider this, but most notable is that the Z-Curve can be customized to the alphabet being used where the Hilbert SFC must be used in base 2. For a 26 character alphabet (e.g. the letters of the English alphabet), the Hilbert SFC must use 5 bits, or space for 32 characters, which then corresponds to a space waste of 1 - (26/32)^(# of dimensions). For 3 dimensions, the Hilbert SFC with a 26-character alphabet uses only 53.6% of the space. For letters and numbers, or 36 characters, the Hilbert SFC must use 6 bits, or space for 64 characters, and in 3 dimensions uses only 17.8% of the total space. For further justification and more detailed explanations as well as some nice visuals, you can see an Adobe Acrobat PDF of my Interim Presentation for the REU program.

After deciding on this as my problem this summer, I modified the existing Squid implementation to use the Z-Curve instead. I was successful at this implementation, but the huge overhead of the full implementation as well as time crunch problems were not providing useful data to analyze the performance of the Z-Curve, so I then wrote a simulator for a P2P network and tested the performance of the Z-Curve there. I ran several tests measuring performance in many areas, and came up with some promising results as well as deeper insight into the problem and areas for further research. This is presented in an Adobe Acrobat PDF of my Final Presentation for this summer.

I am currently exploring the "Future Research" areas laid out in my presentation. I hope to provide rigorous mathematical and algorithmic analysis and verification of the conclusions I drew in my presentation. I also hope to investigate other issues such as the effects of load-balancing on these algorithms and the importance of clustering on the efficiency of searching. My ultimate goal is to put all of my research together into a paper and submit it for publication.

If you are interested in this project and have any questions or comments, please don't hesitate to contact me. My contact information is below.

Acknowledgements

I would really like to thank my advisors, Dr. Manish Parashar and Cristina Schmidt, for all of their mentorship and collaboration with me this summer. I would like to thank DIMACS and Rutgers University for hosting this program, and the National Science Foundation (NSF) for providing funding. I would also like to thank Vincent Matossian for assistance with implementation. Finally, I would like to thank all of my REU-mates this summer for the amazing experience this program has been.

Links

Contact me (June 7, 2004 - July 30, 2004)

Mail: Rutgers University
CoRE Building, 4th Floor/DIMACS Center
96 Frelinghuysen Road
Piscataway, NJ 08854-8018
Attn: Dan Halperin
Email: dhalperi _at_ dimax.rutgers.edu
Phone: (732) 373-3155

Contact me (through 2006)

Mail: Dan Halperin
340 E Foothill Blvd
Claremont, CA 91711
Email: dhalperi _at_ cs.hmc.edu