Proposed projects for the 2005 DIMACS and DIMACS/DIMATIA REU Programs

Project #: DIMACS2005-01

Mentor: Wilma Olson, olson at, Department of Chemistry

Project title: Protein-induced DNA Looping

Many genetic processes are mediated by proteins that bind at separate, often widely spaced, sites on the double helix, tethering the intervening DNA into a loop [1, 2]. Examples of these processes include gene expression and its control, DNA replication, genetic rearrangements, multi-site cutting by restriction enzymes, and DNA packaging. A DNA loop stabilized by such long-range protein-protein contacts constitutes a discrete topological unit. As long as the ends of the DNA stay in place and the duplex remains unbroken, the linking number, Lk, or number of times the two strands of the double helix wrap around one another, is conserved. This constraint in Lk underlies the well known supercoiling of DNA: the stress induced by positioning the ends of the polymer in locations other than the natural (relaxed) state perturbs the overall coiling of the chain axis and/or the twisting of successive base-pairs in the intervening parts of the chain [3]. As a first step in understanding the effects of specific proteins and drugs on DNA looping, we propose to study the imposed bending and twisting of neighboring base pairs [4] in known complexes of proteins and drugs with double helical DNA stored in the Nucleic Acid Database [5]. By subjecting a DNA segment of the same chain length as that found in a given complex to the observed orientation, displacement, and superhelical stress and setting the elastic moduli to sufficiently high values, we can use existing programs, e.g., [6], to simulate the presence of a rigidly bound molecule at arbitrary positions on circular DNA molecules or to model specific systems in which DNA looping plays an important role, e.g., the lac repressor-operator assembly in EscherichiaA0coli [7]. One student could devote a summer project to the collection of geometric information. Other students could apply this information to simulations of DNA loops and folds in subsequent years. Prerequisites: Students should have an interest (but not necessarily formal training) in molecular biology, familiarity with linear algebra in order to understand the parameters used to describe DNA structure, and knowledge of basic chemistry and physics to understand the nature of the DNA molecules and the elastic treatment of the double helix.

[1] Halford, S. E., Gowers, D. M., and Sessions, R. B., ``Two are Better than One,'' Nature Struct. Biol., 7, (2000), 705-707.

[2] Schleif, R., ``DNA Looping,'' Annu. Rev. Biochem., 61, (1992), 199-223.

[3] Bauer, W. R. ``Structure and Reactions of Closed Duplex DNA,'' Annu. Rev. Biophys. Bioeng., 7, (1978), 287-313.

[4] Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M., and Zhurkin, V. B., ``DNA Sequence-dependent Deformability Deduced from Protein-DNA Crystal Complexes,'' Proc. Natl. Acad. Sci. USA, 95, (1998), 11163-11168.

[5] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R., and Schneider, B., ``The Nucleic Acid Database: A Comprehensive Relational Database of Three-dimensional Structures of Nucleic Acids,'' Biophys. J., 63, (1992), 751-759.

[6] Westcott, T. P., Tobias, I., and Olson, W. K., ``Modeling Self-contact Forces in the Elastic Theory of DNA Supercoiling,'' J. Chem. Phys., 107, (1997), 3967-3980.

[7] Muller-Hill, B., The Lac Operon, Walter de Gruyter, Berlin, 1996.

Project #: DIMACS2005-02

Mentor: Paul Kantor, kantor at , Communication, Information and Library Studies

We would like to be able to distinguish authors by their choice of words when writing. Which words matter? How do we know? These are classic questions posed in discussing the Federalist Papers, but now we are trying to apply the concept to identify authors in millions of email messages. What are the right features? What kind of machine learning is required? What statistical theories may govern the process?

Project #: DIMACS2005-03

Mentor: Vladimir A. Gurvich, gurvich at, RUTCOR

I. It is known that, given two mxn matrices A and B, there exist m+n potentials p1,...,pm; q1,...,qn. Such that

max aij + pi - qj = min bij - pi + qj = v for all i = 1, ..., mj = 1, ..., n.

This theorem has important applications in game theory. But how to find such potentials ? No polynomial algorithm is known yet, though one should exist, since the problem is in the intersection of NP and co-NP.

II. Difference graphs. Given n finite sets S1,...,Sn , let us define a graph G with n vertices v1,...,vn in which 2 vertices vi and vj are connected by an edge IFF both differences Si - Sj and Sj - Si have at least k elements each. EVERY graph can be realized in such a way if k is arbitrarily large. But is it still true if k is bounded by a constant ? Even for k = 2 no counter-example is known. If k = 1 then only co-occurrence graphs can be realized.

III. All edges of a complete graph G are colored by n colors, that is labeled by positive integers [n] = {1,...,n}, where n ³ 3. If G contains a triangle whose 3 edges are colored by 3 pairwise distict colors then there exist C1, ... , Cn Í V(G) such that no edge from Ci is labeled by i and C1 Ç ... Ç Cn = Æ.

The general case n ³ 3 can be reduced to n = 3.

This problem has applications in geometry and theory of positional games

IV. Given a (strongly connected) directed graph D = VA), and a subset V' in V of its vertices, list all MINIMAL subsets of arcs A' in A such that for every two vertices v,v' from V' there is a directed path from v to v' whose all arcs are in A'. This problem is tractable in case V' = V but in general it is still open.

V. Given vectors a1, ..., an in Rk and b1, ..., bn in Rm, list all MINIMAL subsets J of {1 ,..., n} such that both families of vectors {aj|j Î J} and {bj|j Î J} are linearly dependent in Rk and in Rm respectively.

Project #: DIMACS2005-04

Mentor: Richard Mammone, mammone at , Department of Electrical Engineering

Project title: Pattern Matching Using Spectral Graph Theory

Given two different photographic images (2D projections of 3d objects) we would like to determine if the two images are of the same 3D object or are projections of two different objects.For example we may have a photograph of a person's face with the person looking directly into the camera in one image and another image of the same or a different person looking to the side.We would like to estimate the likelihood that these two images are of the same person in different positions or are they projections of two different people. Recently,we have found that the likelihood of a match is directly related to the degree of bijectivity of the mapping from one image into the other. The current approach for mapping the images uses the displacement vector field between the two images and the degree of bijectivity is measured by the count of pixels having a bijective mapping normalized by the total pixel count. Thus, the problem is loosely speaking to determine if the two photographic images are isomorphic in some sense.

The proposed project would be to investigate and analyze a Spectral Graph Theory approach to the problem. The mapping between the images could be modeled by a bipartite graph i.e.a set of nodes and interconnecting edges where the nodes are the pixels and the edges indicate matching links between pixels of the two images.The difference between the first and second eigenvalues of the adjacency matrix is a well known measure of the degree of bijectivity of a graph from spectral graph theory.This measure will be studied in the context of pattern recognition of 3D objects from 2D images.

Project #: DIMACS2005-05

Mentor: Michael Littman, mlittman at , Department of Computer Science

We are examining a set of problems in the area of multiagent sequential decision making. One recent result is that in a two-player general-sum infinite-horizon alternating game, a Nash equilibrium exists in mixed stationary strategies and it can be specified by polynomial-precision relational numbers (thus leading to an NP-style algorithm for finding the equilibrium). Some open problems include whether the same can be said of three-player general-sum infinite-horizon alternating games (we think not) and whether two-player general-sum infinite-horizon alternating games have finite-size deterministic non-stationary strategies (we think so). We're also interested in whether the idea of "threatening" strategies (like tit-for-tat in Prisoner's dilemma) can be used to define an efficient algorithm for finding a non-stationary equilibria in these and related games.

Project #: DIMACS2005-06

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Classification of Epileptic Brain Activity

The goals of this project are as follows:

1) to explore and develop optimization, data mining techniques, and statistical models for the quantification of the spatio-temporal properties of the epileptic transition based on brain electrical activity (Electroencephalogram - EEG);

2) to develop novel pattern recognition, classification and clustering algorithms for epileptic brain activity classification based upon the brain connectivity graph structure of the dynamical properties of brain electrical activity underlying epileptic processes.

Project #: DIMACS2005-07

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Heuristics Approaches for Inventory Routing Problems

In this project, we consider problems arising in Supply Chain Management including routing problems, inventory management, and scheduling. Inventory management and vehicle routing are the critical issues in business process. This project will also address the inventory routing problem, which integrates two components of supply chain management: inventory control and vehicle routing. These two problems have been traditionally studied and solved separately. However, the integration of the two can have an astonishing impact on the improvement of overall supply chain system's performance. Computational approaches in integer programming and routing and scheduling heuristics will be developed in this project. These approaches can be applied to problems in finance and economics.

Project #: DIMACS2005-08

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Computational Approaches for Multi-Quadratic Integer Programming Problems

MQP plays an important role in modeling many diverse problems since it provides more efficient models compared to the simpler linear relaxation of a problem. Indeed, linear mixed 0-1, fractional, bilinear, bilevel, generalized linear complementarity, and many more programming problems are or can easily be reformulated as special cases of MQP. However, there are theoretical and practical difficulties in the process of solving such problems. The nonlinear constraints in MQP define a feasible region, which is in general neither convex nor connected. Moreover, even if the feasible region is a polyhedron, optimizing the quadratic objective function is strongly NP-hard as the resulting problem is considered to be the disjoint bilinear programming problem. Therefore, finding a finite and exact algorithm that solves large MQP problems is impractical. In this project, we aim to advance research in MQP, in both theoretical and computational aspects. Many approaches are being developed to solve practical large-scale MQP problems.

Project #: DIMACS2005-09

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Clustering via Quadratic Programming

Cluster analysis is an unsupervised multivariate data analysis that seeks to discover hidden patterns in a data set without apriori knowledge about the data. Most of clustering algorithms attempt to measure similarity or dissimilarity coefficients among all samples of the data. In other words, cluster analysis tries to group the most similar samples in the data. The idea of the project is to investigate cluster analysis based on combinatorial optimization approaches. Specifically, cluster analysis can be modeled mathematically as a quadratic programming problem. To this extent, we will explore various combinatorial techniques to solve this problem and will place cluster analysis within the realm of mathematical programming.

Project #: DIMACS2005-10

Mentor: Vadim Lozin, lozin at, RUTCOR

An independent set in a graph is a subset of vertices, no two of which are connected by an edge. The maximum independent set problem is that of finding in a graph an independent set of maximum size. In general, this problem is computationally intractable. Moreover, a maximum independent set is even hard to approximate. On the other hand, the problem admits efficient solutions for graphs in some particular classes, such as bipartite graphs (i.e. graphs partitionable into two independent sets). Finding new classes for which the maximum independent set problem is efficiently solvable is a challenging research direction. It involves both investigation of the structure of graphs in specific classes and development of various algorithmic graph techniques such as graph transformations, graph decompositions, methods of augmenting graphs, etc.

Project #: DIMACS2005-11

Mentor: S. Muthukrishnan, muthu at Department of Computer Science

We would like to study games in which each node of a path, tree or graph has some number of tokens at time 0. At each time step, each node gets some new tokens, some old tokens disappear, and some others move from one node to another. What are steady state situations with special graphs and starting configurations? We will study these problems by simulation, mathematics and doing new algorithms. These problems have many applications in Epidemiology, Distributed computing, etc.


R. J. Anderson, L. Lovasz, P. W. Shor, J. Spencer, E. Tardos and S. Winograd, Disks, balls and walls: Analysis of a combinatorial game, American Math. Monthly 96, pp. 481-493.

Project #: DIMACS2005-12

Mentor: Peter Hammer, hammer at, RUTCOR

This project is devoted to a frequently encountered problem of data analysis, in which a set of ``observations'' is given, with each of the observations being represented as a vector of binary attribute values. The observations in the data set are of two types, and the type of each observation (e.g., positive or negative) is known. Typical data analysis problems related to such data sets include classification (i.e., identification of the type of a new observation not included in the data set), determination of characteristic properties of observations of the same type, analysis of the role of various attributes, etc. The logical analysis of data (LAD) ([1, 2, 3, 4, 5]) is a methodology addressing the above kinds of problems. The mathematical foundation of LAD is in discrete mathematics, with a special emphasis on the theory of Boolean functions. Patterns are the key building blocks in LAD [4], as well as in many other rule induction algorithms (such as C4.5 rules CN2, AQ17-HCI, RISE, RIPPER and SLIPPER -- see for example [6, 7]). Since a typical data set has an exceedingly large number of patterns, all these algorithms are limited to the consideration of small subsets of patterns. In most algorithms, the choice of such a subset of patterns is not explicitly analyzed, in spite of the fact that it has been observed in empirical studies and practical applications that some patterns are more "suitable" than others for use in data analysis. The goal of this project is to model various such suitability criteria and to analyze the relevance of notions of suitability for algorithms for LAD. Prerequisites: A little knowledge of graph theory, linear programming, and any elements of discrete mathematics are all useful, but none is absolutely necessary.

[1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A., ``Logical Analysis of Numerical Data,'' Math. Programming, 79, (1997), 163-190.

[2] Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., and Muchnik, I., ``An Implementation of Logical Analysis of Data,'' IEEE Trans. on Knowledge and Data Engineering, 12, (2000), 292-306.

[3] Boros, E., Ibarakai, T., and Makino, K., ``Logical Analysis of Binary Data with Missing Bits,'' Artificial Intelligence, 107, (1999), 219-263.

[4] Crama, Y., Hammer, P.L., and Ibaraki, T., ``Cause-effect Relationships and Partially Defined Boolean Functions,'' Annals of Oper. Res., 16, (1988), 299-326.

[5] Ekin, O. Hammer, P.L., and Kogan, A., ``Convexity and Logical Analysis of Data,'' Theoretical Computer Science, 244, (2000), 95-116.

[6] Cohen, W.W., and Singer, Y., ``A Simple, Fast, and Effective Rule Learner,'' in Proc. of the Sixteenth National Conference on Artificicial Intelligence, AAAI Press, Menlo Park, CA, 1999, 335-342.

[7] Domingos, P., ``Unifying Instance-based and Rule-based Induction,'' Machine Learning, 24, (1996), 141-168.

Project #: DIMACS2005-13

Mentor: Endre Boros, boros at, RUTCOR

Project title: Projections Based Learning

A simple generalization of decision tree based learning methods involves, iteratively, the projection of the training set into the space of a small subset of the attributes, derivation of a best classifier in that small space, and the introduction of the obtained signal as a new attribute for the original problem. Under some specific conditions this can be shown to converge to a good classifier. The project would involve extensions of this idea, characterizations of wider classes when the method can be guaranteed to converge, and implementations of such learners.

Project #: DIMACS2005-14

Mentor: Stanley Dunn, smd at, Department of Biomedical Engineering

The accomplishments of modern biology, highlighted by completion of sequencing of the Human Genome, are remarkable and have the potential to lead to unprecedented advances in biotechnology and in health care. The molecular "parts list" of cells and tissues is growing at an accelerating pace, and a major issue limiting the limiting the application of this data is the ability to integrate information on the parts into an understanding of the whole. These needs give rise to a set of activities that we term Molecular Systems Bioengineering (MSB). MSB represents an engineering approach to the understanding and control of biological processes. It encompasses high-throughput microarray data acquisition techniques on genomes, proteomes and other molecular catalogs, and it involves the use of both data-driven and principles-driven modeling techniques to create an understanding of biological phenotype based on a combination of molecular catalogs and environmental conditions.>/p>

The goal of this research is to investigate empirical, or data-driven methods for analyzing DNA microarrays. The aim is to explore data-driven rather than theoretical methods for elucidating regulation network structure from microarrays. Empirical methods, which have been studied in areas such as machine learning, computer vision and cognitive science, have not been widely studied in the context of microarray analysis.

Boosting, bootstrapping and factor space methods will be considered as techniques to develop algorithms for identifying genes clustered by expression as well as identifying regulatory relationships. We will also investigate empirical methods for: statistical analysis of empirical algorithm performance results; empirical comparison of different algorithms; methods / tools / databases for empirical performance evaluation; standardization and independent testing; and design of empirical evaluation methods and protocols. This research program will lead to methods for evaluating algorithms as well as understanding of the role of empirical methods for microarray analysis.

Index Index of REU program
DIMACS Home Page
Contacting the Center
Document last modified on February 16, 2005.