
Mentor: Eric Allender, Computer Science Department
There are many clever algorithms to check if a graph is planar. Hopcroft and Tarjan [1] gave a lineartime algorithm almost 30 years ago, but it was much harder to find efficient parallel algorithms. Ja'Ja' and Simon [2] gave NC algorithms, and a logtime PRAM algorithm with essentially optimal processor and time bounds was given by Ramachandran and Reif [3]. Still, it remained unknown which complexity class planarity is in, until Allender and Mahajan [4] showed that the Ramachandran and Reif algorithm can be reduced to the reachability problem in undirected graphs, showing that planarity is in the class SL. The Ramachandran and Reif algorithm is complicated, because much effort is spent minimizing processors. In this project we will study their algorithm and see if a simpler algorithm can be found that can show that planarity is in SL. Prerequisites: A course in graph theory.
[1] Hopcroft, J., and R. Tarjan, R., ``Efficient Planarity Testing,'' Journal of the ACM, 21, (1974), 549568.
[2] Ja'Ja', J., and Simon, J., ``Parallel Algorithms in Graph Theory: Planarity Testing,'' SIAM Journal on Computing, 11, (1982), 314328.
[3] Ramachandran, V., and Reif, J., ``Planarity Testing in Parallel, Journal of Computer and System Sciences, 49, (1994), 517561.
[4] Allender, E. and Mahajan, M., ``The Complexity of Planarity Testing,'' Proc. 17th Symposium on Theoretical Aspects of Computer Science (STACS) 2000, Lecture Notes in Computer Science 1770, 2000, 8798.
Mentor: Endre Boros, RUTCOR
Nonlinear binary optimization is one of the most general forms of many combinatorial optimization problems (see e.g., [1]). Applications are widespread and include VLSI design, statistical mechanics, portfolio selection, various production and scheduling problems, game theoretical problems, reliability problems, etc. It is a very hard problem to solve, in general. Finding efficiently an approximate solution is very important, and this research direction has become very active in the past 10 years. The best methods, however, are based on formulations in much higher dimensions than the original problem, posing great computational difficulties when these methods are applied. A promising approach to find approximation without increase in the dimension of the problem, introduced in [2], is based on finding tight convex majorants of the objective function. This leads to the question proposed here: given a multilinear polynomial f(x), find a convex function g(x) such that f(x) less than or equal to g(x) for all x in the unit cube [0,1]^{n} for which the maximum difference (g(x)f(x)) over the binary vectors x in {0,1}^{n} is as small as possible. For instance, if f(u,v)=uv in n=2 dimensions, then we know that a best convex majorant of f is g(u,v)=((u+v)/2)^{2}. It is, however, not known what are the tightest convex majorants, even for 3variable multilinear polynomials, e.g., f(u,v,w)=2uv5vw. Best linear majorants were considered in the literature, leading to good bounds [3,4] and to several interesting open questions about polyhedral combinatorics. Possible formulations for finding a tightest convex majorant include linear programming based on row generation, or semidefinite programming, both of which are interesting and somewhat challenging already for small (n=3,4) dimensions. Prerequisite: An introductory course in operations research or linear programming would help, but an entire course is not needed, since the necessary background concerning LP, semidefinite models, etc. can be learned to the extent needed. We will probably only consider problems involving 34 variables.
[1] Padberg, M., ``The Boolean Quadric Polytope: Some Characteristics, Facets and Relatives,'' Mathematical Programming, 45, (1989), 139172.
[2] Badics, T., ``Approximation of Some Nonlinear Binary Optimization Problems,'' Ph.D. Thesis, RUTCOR, Rutgers University, 1996.
[3] Boros, E., Lari, I., and Simeone, B., ``Block Linear Majorants in Quadratic 01 Optimization,'' RUTCOR Research Report, RRR 182000, Rutgers University, March, 2000.
[4] Hammer, P.L., Hansen, P., and Simeone, B., ``Roof Duality, Complementation and Persistency in Quadratic 01 Optimization,'' Mathematical Programming, 28, (1984), 121155.
Mentor: Peter Hammer, RUTCOR
This project is devoted to a frequently encountered problem of data analysis, in which a set of ``observations'' is given, with each of the observations being represented as a vector of binary attribute values. The observations in the data set are of two types, and the type of each observation (e.g., positive or negative) is known. Typical data analysis problems related to such data sets include classification (i.e., identification of the type of a new observation not included in the data set), determination of characteristic properties of observations of the same type, analysis of the role of various attributes, etc. The logical analysis of data (LAD) ([1, 2, 3, 4, 5]) is a methodology addressing the above kinds of problems. The mathematical foundation of LAD is in discrete mathematics, with a special emphasis on the theory of Boolean functions. Patterns are the key building blocks in LAD [4], as well as in many other rule induction algorithms (such as C4.5 rules CN2, AQ17HCI, RISE, RIPPER and SLIPPER  see for example [6, 7]). Since a typical data set has an exceedingly large number of patterns, all these algorithms are limited to the consideration of small subsets of patterns. In most algorithms, the choice of such a subset of patterns is not explicitly analyzed, in spite of the fact that it has been observed in empirical studies and practical applications that some patterns are more "suitable" than others for use in data analysis. The goal of this project is to model various such suitability criteria and to analyze the relevance of notions of suitability for algorithms for LAD. Prerequisites: A little knowledge of graph theory, linear programming, and any elements of discrete mathematics are all useful, but none is absolutely necessary.
[1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A., ``Logical Analysis of Numerical Data,'' Math. Programming, 79, (1997), 163190.
[2] Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., and Muchnik, I., ``An Implementation of Logical Analysis of Data,'' IEEE Trans. on Knowledge and Data Engineering, 12, (2000), 292306.
[3] Boros, E., Ibarakai, T., and Makino, K., ``Logical Analysis of Binary Data with Missing Bits,'' Artificial Intelligence, 107, (1999), 219263.
[4] Crama, Y., Hammer, P.L., and Ibaraki, T., ``Causeeffect Relationships and Partially Defined Boolean Functions,'' Annals of Oper. Res., 16, (1988), 299326.
[5] Ekin, O. Hammer, P.L., and Kogan, A., ``Convexity and Logical Analysis of Data,'' Theoretical Computer Science, 244, (2000), 95116.
[6] Cohen, W.W., and Singer, Y., ``A Simple, Fast, and Effective Rule Learner,'' in Proc. of the Sixteenth National Conference on Artificicial Intelligence, AAAI Press, Menlo Park, CA, 1999, 335342.
[7] Domingos, P., ``Unifying Instancebased and Rulebased Induction,'' Machine Learning, 24, (1996), 141168.
Mentor: David Madigan, Department of Statistics
Text categorization is the automated assigning of documents to predefined contentbased categories. Applications of text categorization include controlled vocabulary indexing (in both traditional bibliographic databases and newer Webbased directories), content filtering of email and Web access, subsetting information feeds, and data mining on records with both textual and nontextual fields. This project will involve developing and evaluating novel Bayesian statistical methods for text categorization and applying them to very large collections of documents. Prerequisites for this work are a basic proficiency in computer programming as well as an undergraduate course in probabiliity and statistics.
Key References include:
Mentor: Wilma Olson, Department of Chemistry
Many genetic processes are mediated by proteins that bind at separate, often widely spaced, sites on the double helix, tethering the intervening DNA into a loop [1, 2]. Examples of these processes include gene expression and its control, DNA replication, genetic rearrangements, multisite cutting by restriction enzymes, and DNA packaging. A DNA loop stabilized by such longrange proteinprotein contacts constitutes a discrete topological unit. As long as the ends of the DNA stay in place and the duplex remains unbroken, the linking number, Lk, or number of times the two strands of the double helix wrap around one another, is conserved. This constraint in Lk underlies the well known supercoiling of DNA: the stress induced by positioning the ends of the polymer in locations other than the natural (relaxed) state perturbs the overall coiling of the chain axis and/or the twisting of successive basepairs in the intervening parts of the chain [3]. As a first step in understanding the effects of specific proteins and drugs on DNA looping, we propose to study the imposed bending and twisting of neighboring base pairs [4] in known complexes of proteins and drugs with double helical DNA stored in the Nucleic Acid Database [5]. By subjecting a DNA segment of the same chain length as that found in a given complex to the observed orientation, displacement, and superhelical stress and setting the elastic moduli to sufficiently high values, we can use existing programs, e.g., [6], to simulate the presence of a rigidly bound molecule at arbitrary positions on circular DNA molecules or to model specific systems in which DNA looping plays an important role, e.g., the lac repressoroperator assembly in EscherichiaA0coli [7]. One student could devote a summer project to the collection of geometric information. Other students could apply this information to simulations of DNA loops and folds in subsequent years. Prerequisites: Students should have an interest (but not necessarily formal training) in molecular biology, familiarity with linear algebra in order to understand the parameters used to describe DNA structure, and knowledge of basic chemistry and physics to understand the nature of the DNA molecules and the elastic treatment of the double helix.
[1] Halford, S. E., Gowers, D. M., and Sessions, R. B., ``Two are Better than One,'' Nature Struct. Biol., 7, (2000), 705707.
[2] Schleif, R., ``DNA Looping,'' Annu. Rev. Biochem., 61, (1992), 199223.
[3] Bauer, W. R. ``Structure and Reactions of Closed Duplex DNA,'' Annu. Rev. Biophys. Bioeng., 7, (1978), 287313.
[4] Olson, W. K., Gorin, A. A., Lu, X.J., Hock, L. M., and Zhurkin, V. B., ``DNA Sequencedependent Deformability Deduced from ProteinDNA Crystal Complexes,'' Proc. Natl. Acad. Sci. USA, 95, (1998), 1116311168.
[5] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.H., Srinivasan, A. R., and Schneider, B., ``The Nucleic Acid Database: A Comprehensive Relational Database of Threedimensional Structures of Nucleic Acids,'' Biophys. J., 63, (1992), 751759.
[6] Westcott, T. P., Tobias, I., and Olson, W. K., ``Modeling Selfcontact Forces in the Elastic Theory of DNA Supercoiling,'' J. Chem. Phys., 107, (1997), 39673980.
[7] MullerHill, B., The Lac Operon, Walter de Gruyter, Berlin, 1996.
Mentors: Kobbi Nissim and Ofer Melnik, DIMACS
There are many problems that lie on the frontier between machine learning and algorithmics. As an example one may look at issues of dimensionality as the number of input variables or model parameters increase, problems become computationally harder and more opaque. Thus the intrinsic challenges revolve around finding ways to address issues of highdimensionality. Approaches to date have involved approximation, dimensionality reduction, rerepresentation, complexity analysis, generalization/model assumptions or exploiting regularity.
This research project will involve some programming, design and analysis. A background in computer science, programming, math/ stats with an interest in machine learning is desired.
Dr. Gurvich has three problems to work on, students selected to work with Dr. Gurvich may work on one or more of these problems.
max a_{ij} + p_{i}  q_{j} = min b_{ij}  p_{i} + q_{j} = v for all i = 1,...,m; j = 1,...,n.
This theorem has important applications in game theory. But how to find such potentials? No polynomial algorithm is known yet, though the problem is in the intersection of NP and coNP.
Mentor: Michael Rothkopf, RUTCOR
Most models of auctions ignore bid preparation costs. Most if not all of those that don't ignore them treat them as a fixed amount. I am interested in a more nuanced model of bid preparation costs in which bidders may invest more or less effort in bid preparation and in which high effort implies better value estimation. While many approaches are possible, I envision the following class of models: Before an auction of a given type (standard sealed bid, sealed secondprice, progressive), each of n a priori symmetric bidders has to decide how much to invest in gathering information about her value for what is to be sold. The value is, in the general case, composed of a private value component and a value component common to all bidders, and the information gathering investment can be allocated between these components.
After each bidder makes her investment and gets the information, the bidders then participate in the auction. Several issues are of interest. How much is spent on evaluation? If there is both private and common information, how is the effort allocated between seeking information on the common value and how much on the private value? How inefficient is the allocation of the item compared to the case when information is free? How do these answers depend upon the number of bidders, the cost of information, and the type of auction?
I envision this analysis being done, at least initially, on a simply, stylized model in which random variables have independent uniform distributions. The student working on this should have had a course in mathematical statistics or probability theory. Some background in microeconomics or game theory would be a plus.
Mentor: Rebecca Wright, DIMACS
The exact problem will be chosen from general topic areas that include design and analysis of cryptographic protocols, distributed cryptography, privacy, secure multiparty computation, and faulttolerant distributed computing. Possible goals include designing protocols, systems, and services that perform their specified computational or communication functions even if some of the participants or underlying components behave maliciously, and that balance individual needs such as privacy with collective needs such as network survivability and public safety.
Depending on the interests and expertise of the student, this project can be more focused on foundational mathematical results, on implementation and experimentation, or on written analysis of software usage and its security implications.
NOTE: Dr. Wright is currently as AT&T Labs. She will be a Visiting Research Associate at DIMACS during the term of the REU program. Background information on her interests can be found at her website: www.kiskeya.net/rwright
Mentor: Jonathan Eckstein, RUTCOR