## Project #: DIMACS2003-01

Mentor: Vladimir A. Gurvich, gurvich@rutcor.rutgers.edu, RUTCOR

Dr. Gurvich has four problems to work on, students selected to work with Dr. Gurvich may work on one or more of these problems.

I. Let us consider the game of Nim. There are n piles of k1,k2, ..., kn stones. Two players move alternatingly. For one move it is allowed to take any number of stones from any, but only from ONE, pile. In the standard version of Nim the player who takes the last stone wins. In the so-called misere version, this player lost. In 1902, exacly one hundred years ago, Bouton published in Annals of Mathematics his paper: Nim, a game with a complete mathematical threory, where he gave an elegant solution of the game. Rather surprisingly the winning strategies are ``almost'' the same for the standard and misere Nim. More presisely, there are only 2n positions where the optimal moves differ, while in the rest of (potentially infinite set of) the positions the optimal moves in Nim and misere Nim always coincide. We conjecture that this is not a paradox but rather general property of games with perfect information and suggest to verify this conjecture for different Nim-like and other types of games.

II. It is known that, given two mxn matrices A and B, there exist m+n potentials p1,...,pm; q1,...,qn. Such that

max aij + pi - qj = min bij - pi + qj = v for all i = 1,...,m; j = 1,...,n.

This theorem has important applications in game theory. But how to find such potentials? No polynomial algorithm is known yet, though the problem is in the intersection of NP and co-NP.

III. Difference graphs. Given n finite sets S1,...,Sn , let us define a graph G with n vertices v1,...,vn in which 2 vertices vi and vj are connected by an edge IFF both differences Si \ Sj and Sj \ Si have at least k elements each. EVERY graph can be realized in such a way if k is arbitrarily large. But is it still true if k is bounded by a constant? Even for k=2 no counter-example is known. If k=1 then only co-occurrence graphs can be realized.

IV. In 1912 Zermelo proved that games with perfect information (e.g. Chess) can be solved in pure stratagies, that is without randomizing. This theorem generalizes the case of n players as follows. In 1951 Nash introduced his famous concept of equilibrium and in 1953 Kuhn proved that every finite n-person game with perfect information has at least one Nash equilibrium in pure strategies.

Yet, positional games, even with perfect information, may have cycles, that is the same position can appear twice. A strategy is called STATIONARY if in every position the move may only depend on this position but not on the preceding moves. Does Zermelo and Kuhn's theorems generalize this case? Is it true that games with perfect information can be solved in PURE STATIONARY strategies? In general, it is NOT true. Yet, if if the local cost function is non-negative for each player and move, or even weaker, if for every player and for every directed cycle the sum of the local costs is non-negative, then the question is open. Moreover, a positive answer is known for some special cases.

## Project #: DIMACS2003-02

Project title: Text Categorization

Text categorization is the automated assigning of documents to predefined content-based categories. Applications of text categorization include controlled vocabulary indexing (in both traditional bibliographic databases and newer Web-based directories), content filtering of email and Web pages, subsetting information feeds, and data mining on records with both textual and nontextual fields. This project will involve developing and evaluating novel Bayesian statistical methods for text categorization and applying them to very large collections of documents. I am especially interested in algorithms that deal with one document at a time rather than large batches of documents.

Prerequisites for this work are a basic proficiency in computer programming as well as an undergraduate course in probabiliity and statistics.

Key References include:

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B., Bayesian Data Analysis, Chapman and Hall, 1995.

Keim, M., Madigan, D., and Lewis, D., ``Bayesian Information Retrieval,'' Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, 1997, 303--310.

Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). ``Training Algorithms for Linear Text Classifiers,'' SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Konstanz. Hartung-Gorre Verlag, 1996, 298--306.

Titterington, D.M. (1984). Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society (Series B), 46, 257-267.

## Project #: DIMACS2003-03

Mentor: Graham Cormode, graham@dimacs.rutgers.edu, DIMACS postdoc

Project title: String similarity methods

String edit distance has been studied for over 30 years, the minimum number of character insertions or deletions necessary to transform one string into another[1]. Recently, there has been renewed interest in questions beyond simply calculating edit distance thanks to its application in analyzing biological sequences, intercepted messages, and advanced (web) searching applications. A number of new questions present themselves: what about approximating these distances, allowing substring moves, computing a distnace by communicating or with small space usage. Can string edit distance, or something like it, be used to cluster sequences, search collections based on string similarity, and so on? There are many interesting questions in this area, many of which it will be practical to make progress on.

A course in Algorithms and Data Structures is very helpful. Information Theory, or knowledge of Compression would be useful but not vital.

[1] See, for example, "Algorithms on Strings, Trees and Sequences" by Dan Gusfield for more information.

## Project #: DIMACS2003-04

Mentor: Leonid Khachiyan, leonid@cs.rutgers.edu, Computer Science

Project Title: Dualization of Posets

## Project #: DIMACS2003-05

Mentor: Wilma Olson, olson@rutchem.rutgers.edu, Department of Chemistry

Project title: Protein-induced DNA Looping

Many genetic processes are mediated by proteins that bind at separate, often widely spaced, sites on the double helix, tethering the intervening DNA into a loop [1, 2]. Examples of these processes include gene expression and its control, DNA replication, genetic rearrangements, multi-site cutting by restriction enzymes, and DNA packaging. A DNA loop stabilized by such long-range protein-protein contacts constitutes a discrete topological unit. As long as the ends of the DNA stay in place and the duplex remains unbroken, the linking number, Lk, or number of times the two strands of the double helix wrap around one another, is conserved. This constraint in Lk underlies the well known supercoiling of DNA: the stress induced by positioning the ends of the polymer in locations other than the natural (relaxed) state perturbs the overall coiling of the chain axis and/or the twisting of successive base-pairs in the intervening parts of the chain [3]. As a first step in understanding the effects of specific proteins and drugs on DNA looping, we propose to study the imposed bending and twisting of neighboring base pairs [4] in known complexes of proteins and drugs with double helical DNA stored in the Nucleic Acid Database [5]. By subjecting a DNA segment of the same chain length as that found in a given complex to the observed orientation, displacement, and superhelical stress and setting the elastic moduli to sufficiently high values, we can use existing programs, e.g., [6], to simulate the presence of a rigidly bound molecule at arbitrary positions on circular DNA molecules or to model specific systems in which DNA looping plays an important role, e.g., the lac repressor-operator assembly in EscherichiaA0coli [7]. One student could devote a summer project to the collection of geometric information. Other students could apply this information to simulations of DNA loops and folds in subsequent years. Prerequisites: Students should have an interest (but not necessarily formal training) in molecular biology, familiarity with linear algebra in order to understand the parameters used to describe DNA structure, and knowledge of basic chemistry and physics to understand the nature of the DNA molecules and the elastic treatment of the double helix.

[1] Halford, S. E., Gowers, D. M., and Sessions, R. B., ``Two are Better than One,'' Nature Struct. Biol., 7, (2000), 705-707.

[2] Schleif, R., ``DNA Looping,'' Annu. Rev. Biochem., 61, (1992), 199-223.

[3] Bauer, W. R. ``Structure and Reactions of Closed Duplex DNA,'' Annu. Rev. Biophys. Bioeng., 7, (1978), 287-313.

[4] Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M., and Zhurkin, V. B., ``DNA Sequence-dependent Deformability Deduced from Protein-DNA Crystal Complexes,'' Proc. Natl. Acad. Sci. USA, 95, (1998), 11163-11168.

[5] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R., and Schneider, B., ``The Nucleic Acid Database: A Comprehensive Relational Database of Three-dimensional Structures of Nucleic Acids,'' Biophys. J., 63, (1992), 751-759.

[6] Westcott, T. P., Tobias, I., and Olson, W. K., ``Modeling Self-contact Forces in the Elastic Theory of DNA Supercoiling,'' J. Chem. Phys., 107, (1997), 3967-3980.

[7] Muller-Hill, B., The Lac Operon, Walter de Gruyter, Berlin, 1996.

## Project #: DIMACS2003-06

Mentor: Doron Zeilberger, zeilberg@math.rutgers.edu, Math Dept.

Project title: Experimental Yet Rigorous Mathematics

The research program of Doron Zeilberger (see http://www.math.rutgers.edu/~zeilberg/) consists in using computer algebra to, experimentally, find rigorous proof of open problems in mathematics, especially in combinatorics. A good knowledge of Maple, and programming experience (in Maple or other languages) is preferred.

## Project #: DIMACS2003-07

Mentor: S. Muthu Muthukrishnan, muthu@cs.rutgers.edu, Dept. of Computer Science

Project Title: Monitoring methods for topic drift in message streams

Consider the series of emails one receives. Hot topics appear and topics that were once hot disappear over time. How to develop mathematics needed to track the topics of current interest, design algorithms for analyzing this process and build systems to monitor the email stream? These issues are relevant to many applications including in homeland security.

All aspects of this problem need to be understood: Markov Modelling methods in Mathematics, Streaming algorithms for tracking topics, Usable software for this problem, are some of the examples.

I am looking for REUs who can address any of these aspects of this problem.

## Project #: DIMACS2003-08

Mentor: Michael Capalbo, mcapalbo@dimacs.rutgers.edu, DIMACS postdoc

Consider the following problem. We are given a necklace with N beads, where each bead is colored one of k colors. We want to cut up the necklace into as few pieces as possible, in such a way that for each color i, the number of beads of color i that is in the odd piece(s) is the same as the number of beads of color i that is in the even piece(s). For example, if every bead in the neckace is the same color (k =1), then all we need to do is to cut the necklace into only 2 pieces of the same size. On one hand, it has been shown by Goldberg and West that there is always such a way of cutting the necklace that uses only k+1 pieces. On the other hand, for large k, there is no efficient algorithm for actually finding how to cut the necklace using even a much larger number of cuts, say, even k log N cuts; all we know how to do now is check many of the possibilities, which takes too long. Thus, the open problem is to come up with an efficient algorithm for finding such a way of cutting the necklace that uses only a 'small' number of pieces.

## Project #: DIMACS2003-09

Mentor: James Abello, abello@dimacs.rutgers.edu, DIMACS

Project Title: External Memory Graph Algorithms

Description: When a graph does not fit in RAM many of the classical algorithms break down. Operations that we take for granted, like graph traversing, get wretched when they are faced with the I/O bottleneck. Efficient external memory algorithms have been devised for Connectivity and Minimum Spanning Trees. However, no satisfactory I/O results exist for very fundamental problems like BFS and DFS. We would like to proceed with our investigations on these and related graph problems like strongly connected components, shortest path trees and quasi-clique computations.

From the experimental side we want to apply the developed algorithms to the DIMACS and ACM members graphs and to some segments of the internet repository.

Requirements: Students interested in joining this research should have background on Algorihtms and Data Structures. Some proficiency in C/C++ and Java is preferable.

References:

[1] B. Cipra, Massive Graphs Pose Big Problems, Siam News, 1999. http://www.siam.org/siamnews/04-99/massive.htm

[2] J. Abello, A. Buschbaum, J. Westbrook, A Functional Approach to External Graph Algorithms, Algorithmica, Vol. 32, Issue 3, pp 437-458. http://www.research.att.com/areas/visualization/abstracts/2001.12.htm

[3] J. Abello, J. Vitter(eds), External Memory Algorithms, Vol 50 of the AMS-DIMACS Series, 1999.

## Project #: DIMACS2003-10

Mentor: James Abello, abello@dimacs.rutgers.edu, DIMACS

Project Title: Computational Geometry

Description: One interesting growing set of problems has emerged in the area of Geometric Graphs. From the complexity point of view some of these problems are known to be only in PSPACE. The type of questions that we are addressing ask for the existence of a geometric configuration that satisfies some a priori combinatorial condition. An example of this is the problem of Visibility Graph Recognition. It is not even known to be in NP or NP-Hard. We want to tackle this and similar realizability questions.

Requirements: Some knowledge and taste of computational geometry and combinatorics are good starting points for this research.

References: Visibility Graphs of Staircase Polygons and The Weak Bruhat Order, Discrete and Computational Geometry, Vol. 14, No 3, 1995, pp. 331-358.

## Project #: DIMACS2003-11

Mentor: James Abello, abello@dimacs.rutgers.edu, DIMACS

Project Title: Visualization and Graph Mining

Description: A variety of massive data sets have an underlying graph structure. Examples are the Web, the Internet and Call detail graphs. They have low diameter, are sparse and obey a power law. We are just beginning to understand how these characteristics can be exploited to drive a data mining engine in charge of extracting useful information from the data. A central question of interest is how to represent, navigate and visualize a graph with millions of nodes and edges. The main problem to be solved is termed the Screen bottleneck. A good set of techniques to attack this question are based on Graph Hierarchy Trees. We want to further develop these techniques.

A collection of exploratory algorithms will be applied to a variety of DIMACS and ACM member graphs and to some segments of the internet repository.

Requirements: Students interested in joining this research should have background on Algorithms and Data Structures. Knowledge of C/C++ and Java is preferably. Some graphics expertise is recommended but no necessary.

References:

[1] FOCS Workshop on Algorithms and Models for the Web Graph, http://www.almaden.ibm.com/cs/people/ravi/waw.html

[2] J. Abello, J. Korn, MGV: A System for Visualizing Massive MultiDigraphs, IEEE Transactions on Visualization and Computer Graphics, Vol. 8, No 1, January-March 2002 http://www.research.att.com/areas/visualization/abstracts/2001.11.htm

## Project #: DIMACS2003-12

Mentor: Alexander Barg, abarg@dimacs.rutgers.edu, DIMACS

Project Title: Constructions and performance of binary codes from bipartite graphs

Recent progress in construction of good binary code families is associated with codes from bipartite graphs decodable by simple iterative procedures. This project is concerned with bipartite-graph codes for which convergence of decoding algorithms is proved relying upon expanding properties of the underlying graph. Several steps in theoretical understanding of these codes were made in [1],[2],[3] and subsequent papers.

The purpose of this project is to construct and establish performance of several specific examples of expander codes. In the first stage the student will have to study and implement in software constructions of expander graphs. Completion of this stage will result in good understanding of expander graphs and codes, which both are in the forefront of research in theoretical CS.

The second stage suggests to simulate error probability of several codes from the family described and to compare their performance with another popular code family, the so-called low-density parity-check codes decoded by a message-passing iterative algorithm, for which the error probability of decoding is known and appears in the literature. The second stage, depending on several factors, may result in a research publication.

Prerequisites: Good programming skills. Some prior exposure to graph theory, algebra, and coding theory would be beneficial, but can be obtained through reading in the first part of the project.

References:
[1] M. Sipser and D. Spielman, "Expander codes" IEEE Trans. Inform. Theory, vol. 42, pp. 1710-1722, 1996.
[2] G. Zemor, On expander codes, ibid., vol. 47, pp. 835-837, 2001.
[3] A. Barg and G. Zemor, Error exponents of expander codes, ibid., vol. 48, pp. 1725-1729, 2002.

Index of REU program