Proposed projects for the 2006 DIMACS and DIMACS/DIMATIA REU Programs

Project #: DIMACS2006-01

Mentor: Wilma Olson, olson at, Chemistry

Project title: Protein-induced DNA Looping

Many genetic processes are mediated by proteins that bind at separate, often widely spaced, sites on the double helix, tethering the intervening DNA into a loop [1, 2]. Examples of these processes include gene expression and its control, DNA replication, genetic rearrangements, multi-site cutting by restriction enzymes, and DNA packaging. A DNA loop stabilized by such long-range protein-protein contacts constitutes a discrete topological unit. As long as the ends of the DNA stay in place and the duplex remains unbroken, the linking number, Lk, or number of times the two strands of the double helix wrap around one another, is conserved. This constraint in Lk underlies the well known supercoiling of DNA: the stress induced by positioning the ends of the polymer in locations other than the natural (relaxed) state perturbs the overall coiling of the chain axis and/or the twisting of successive base-pairs in the intervening parts of the chain [3]. As a first step in understanding the effects of specific proteins and drugs on DNA looping, we propose to study the imposed bending and twisting of neighboring base pairs [4] in known complexes of proteins and drugs with double helical DNA stored in the Nucleic Acid Database [5]. By subjecting a DNA segment of the same chain length as that found in a given complex to the observed orientation, displacement, and superhelical stress and setting the elastic moduli to sufficiently high values, we can use existing programs, e.g., [6], to simulate the presence of a rigidly bound molecule at arbitrary positions on circular DNA molecules or to model specific systems in which DNA looping plays an important role, e.g., the lac repressor-operator assembly in EscherichiaA0coli [7]. One student could devote a summer project to the collection of geometric information. Other students could apply this information to simulations of DNA loops and folds in subsequent years. Prerequisites: Students should have an interest (but not necessarily formal training) in molecular biology, familiarity with linear algebra in order to understand the parameters used to describe DNA structure, and knowledge of basic chemistry and physics to understand the nature of the DNA molecules and the elastic treatment of the double helix.

[1] Halford, S. E., Gowers, D. M., and Sessions, R. B., ``Two are Better than One,'' Nature Struct. Biol., 7, (2000), 705-707.

[2] Schleif, R., ``DNA Looping,'' Annu. Rev. Biochem., 61, (1992), 199-223.

[3] Bauer, W. R. ``Structure and Reactions of Closed Duplex DNA,'' Annu. Rev. Biophys. Bioeng., 7, (1978), 287-313.

[4] Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M., and Zhurkin, V. B., ``DNA Sequence-dependent Deformability Deduced from Protein-DNA Crystal Complexes,'' Proc. Natl. Acad. Sci. USA, 95, (1998), 11163-11168.

[5] Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R., and Schneider, B., ``The Nucleic Acid Database: A Comprehensive Relational Database of Three-dimensional Structures of Nucleic Acids,'' Biophys. J., 63, (1992), 751-759.

[6] Westcott, T. P., Tobias, I., and Olson, W. K., ``Modeling Self-contact Forces in the Elastic Theory of DNA Supercoiling,'' J. Chem. Phys., 107, (1997), 3967-3980.

[7] Muller-Hill, B., The Lac Operon, Walter de Gruyter, Berlin, 1996.

Project #: DIMACS2006-02

Mentor: John Kolassa, kolassa at , Statistics

The proposed research will apply a variety of mathematical techniques, including multivariate complex analysis and combinatorics, to open questions concerning inference for small samples. This inference includes standard frequentist techniques including the calculation of p-values and confidence intervals. Techniques from multivariate complex analysis will be used to approximate multivariate tail probabilities. Techniques that allow for increased power of conditional inference while retaining strict control of Type 1 error rates will be extended.

Project #: DIMACS2006-03

Mentor: Vadim Lozin, lozin at, RUTCOR

This project focuses on the study of several fundamental problems of algorithmic graph theory, which are of both practical importance and theoretical significance.

In a graph, an independent set is a subset of vertices, no two of which are adjacent. The maximum independent set problem is that of finding in a graph an independent set of maximum size. Independent domination is the problem of finding an independent set of minimum size, which is maximal with respect to set inclusion. Vertex colorability can be formulated as the problem of partitioning the set of vertices into minimum number of subsets, each of which is independent.

In general, all three problems are computationally intractable. On the other hand, they admit efficient solutions for graphs in some particular classes, such as bipartite graphs (the independent set problem), split graphs (independent domination) or cubic graphs (vertex colorability). Finding new classes of graphs for which the problems are efficiently solvable is a challenging research direction. It involves both investigation of the structure of graphs in specific classes and development of various algorithmic graph techniques such as graph transformations, graph decompositions, etc.

Project #: DIMACS2006-04

Mentor: Kah Loon Ng,, DIMACS

Project title: Firefighting on Graphs

Consider a graph where one vertex is randomly chosen to be "set on fire". At each turn, we have a number of firefighters to position at vertices of our choice so as to protect these vertices from catching the fire. All other vertices who are adjacent to at least one other vertex who is currently on fire on that turn will catch fire if they are not protected by a firefighter. The process continues until either all vertices in the graph are either on fire or protected, or when the fire can no longer spread. This interesting "game" is known as the Firefighter problem and has nice applications to computational epidemiology. In this project, we will be looking at some of the existing work that has been done and possibly extend some of the results known so far.

Project #: DIMACS2006-05

Mentors: Nina Fefferman, , Tufts University, DIMACS and Kah Loon Ng, DIMACS

Project title: Evolution of Social Networks

The theory of social networks uses graphs to describe the connections between individuals in a population (e.g. by marriage, social interaction, money flows, etc.). I am interested in the effect of individual behaviors (association choices) on the total network. Preliminary simulations show that small differences in behaviors can drastically change the resulting global network. At the moment, all of the work is being performed on the individual level. For the summer, it would be interesting to take a look at "small group choices" - what happens if local groups of individuals band together. When does it "profit" individuals to form such groups. The next steps for this research could involve further simulation, game theory, graph theory, combinatorics, possibly even some statistics.

Project #: DIMACS2006-06

Mentor: Nina Fefferman, , Tufts University, DIMACS

Project title: Behavioral Epidemiology

In infectious disease, the secondary spread of disease depends on contact with contagious individuals. Many mathematical models have investigated the various contact rates and thresholds that are required for epidemic outbreaks scenarios, however, these mainly use population-wide averages for contact rates. Assuming small, simplified population structures and contact networks, and limiting the number of behavioral options to something small (probably something like "health care workers" vs "non health care workers", and "single individuals", "individuals in nuclear families", and "individuals living communally" - like college students), it would be interesting to analyze the costs and benefits of healthcare seeking behavior under the assumption of centralized health care (i.e. no home visits) for both individual and 'total societal' protection.

Project #: DIMACS2006-07

Mentor: S. Muthukrishnan, muthu at Computer Science,
Co-mentors: Graham Cormode, Bell Labs and Christian Sohler, Universitat Paderborn

We would like to study games in which each node of a path, tree or graph has some number of tokens at time 0. At each time step, each node gets some new tokens, some old tokens disappear, and some others move from one node to another. What are steady state situations with special graphs and starting configurations? We will study these problems by simulation, mathematics and doing new algorithms. These problems have many applications in Epidemiology, Distributed computing, etc.


R. J. Anderson, L. Lovasz, P. W. Shor, J. Spencer, E. Tardos and S. Winograd, Disks, balls and walls: Analysis of a combinatorial game, American Math. Monthly 96, pp. 481-493.

Project #: DIMACS2006-08

Mentor: S. Muthukrishnan, muthu at Computer Science,
Co-mentors: Graham Cormode, Bell Labs and Christian Sohler, Universitat Paderborn

Project title: Algorithms for designing auctions on the internet

Internet companies like yahoo!, google and ebay perform auctions. There is a nice theory of how to design auctions. This project will be about designing new auctions for some of the emerging internet commodities.

Project #: DIMACS2006-09

Mentor: E.A. Elsayed, , Industrial and Systems Engineering

Project title: Sheet Folding Theory and Applications

Investigate the generation of three dimensional patterns from flat sheet of materials without cutting any parts of the sheet. Investigate methods for folding the generated pattern and construct samples for testing at different conditions. Identify areas of applications with justifications.

Project #: DIMACS2006-10

Mentor: Paul Kantor, , Communication, Information and Library Studies

Project title: Automated Entity Resolution

A challenging problem at the intersection of computation, mathematics and statistics arises when we ask whether two or more real entities are referenced by the same "label". For example, there are hundred of bio-science papers written by "T. Suzuki". How many different real persons does this one label represent? Stephen King writes under several different names: can he be discovered despite this disguise? Analysis is based on finding patterns in the relations between this author, other authors, physical institutions and locations, and abstract concepts such as the "topics of the papers". Students working on this project will have an opportunity to participate in Rutgers/DIMACS efforts on the "KDD Challenge 2006", a highly competitive exercise in entity resolution.

Project #: DIMACS2006-11

Mentor: Ivan Zorych, , DIMACS

Project title: Statistical Methods of Signal Detection

The World Health Organization (WHO) has defined a signal as "Reported information on a possible causal relationship between an adverse event and a drug, the relationship being unknown or incompletely documented previously"[1]. A good overview of existing methods is given in [2].

We will explore different statistical models and methods used to deal with count data and contingency tables [3, 4, 5]. Potential difficulties such as Simpson's paradox will be investigated. All methods will be tested on sample datasets.

Our ultimate goal is to look for adverse drug reaction signals in real data from the FDA Adverse Event Reporting System (AERS) or CADRMP Adverse Reaction Database maintained by Health Canada.

Prerequisite(s): an introductory course in Applied Statistics is a plus; familiarity with statistical packages such as SAS, R, or S-plus will be helpful.


1. WHO web-site. WHO Collaborating Center for International Drug Monitoring. Definitions,
2. Stephen Evans, "Statistical methods of signal detection", Pharmacovigilance, pages 273-280, Wiley, 2002.
2. Peter Congdon, Bayesian Models for Categorical Data, Wiley, 2005.
4. Alan Agresti, An Introduction to Categorical Data Analysis, Wiley, 1996.
5. William DuMouchel, "Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System", The American Statistician, vol. 53, pages 177-190, 1999.

Project #: DIMACS2006-12

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Classification of Epileptic Brain Activity

Epilepsy, which is the second most common brain disorder after stroke, currently afflicts at least 2 million Americans. It is a chronic condition of diverse etiologies with the common symptom of spontaneous recurrent seizures. The goals of this project are: 1) to explore and develop quantitative techniques to investigate the spatio-temporal properties of the epileptic transition based on brain electrical activity (Electroencephalogram - EEG); 2) to develop novel pattern recognition, classification and clustering algorithms for epileptic brain activity classification based upon the brain electrical connectivity underlying the epileptic processes.

Project #: DIMACS2006-13

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Mathematical Programming Models for Data Mining Tasks

Data mining seeks to discover hidden patterns in a data set without explicit knowledge about the data. Most of data mining tasks attempt to measure similarity or dissimilarity coefficients among all samples of the data for classification, clustering, and feature selection operations. The idea of the project is to investigate mathematical models that represent these data mining tasks. The outcome of this project can be applied to many real-world problems including quantitative finance, computational biology, manufacturing processes, supply chain, and operations management.

Project #: DIMACS2006-14

Mentor: Wanpracha (Art) Chaovalitwongse, wchaoval at, Department of Industrial Engineering

Title: Sibling Relationship Reconstruction

This project is aimed at developing novel solution approaches for the problem of reconstructing sibling relationships based on single generation genetic data without parental information. This problem is very difficult but extremely important because knowledge of familial relationships is needed for many biological applications including the estimation of heritabilities of quantitative characters, studies of mating systems and fitness, and managing populations of endangered species. Typically, biologists have used parental data to establish sibling groups indirectly through parentage assignments. Reconstructing sibling relationships without parental data is a much more difficult problem, but one that faces many investigators who sample and genotype cohorts of offspring rather than parent/offspring groups. One of the most promising ideas is to employ simple genetic constraints on the full-sibling groups, imposed by the Mendelian inheritance rules, to data from codominant DNA markers. To incorporate this idea, novel statistical analyses and optimization techniques will be investigated in this project.

Project #: DIMACS2006-15

Mentor: Stanley Dunn, smd at, Department of Biomedical Engineering

Title: Decomposition of Multidimensional Microarray Datasets

The accomplishments of modern biology, highlighted by completion of sequencing of the Human Genome, are remarkable and have the potential to lead to unprecedented advances in biotechnology and in health care. The molecular "parts list" of cells and tissues is growing at an accelerating pace, and a major issue limiting the limiting the application of this data is the ability to integrate information on the parts into an understanding of the whole. These needs give rise to a set of activities that we term Molecular Systems Bioengineering (MSB). MSB represents an engineering approach to the understanding and control of biological processes. It encompasses high-throughput microarray data acquisition techniques on genomes, proteomes and other molecular catalogs, and it involves the use of both data-driven and principles-driven modeling techniques to create an understanding of biological phenotype based on a combination of molecular catalogs and environmental conditions.

The goal of this research is to investigate methods for analyzing DNA microarrays based on tensor and multidimensional decomposition methods. Early results on yeast sporulation data suggest that multidimensional decompositions that yield non-orthognal bases do cluster genes by phase and function. The aim of this project will be to explore other data sets and consider robust algorithms for microarray analysis.

Project #: DIMACS2006-16

Mentor: Endre Boros, boros at, RUTCOR

Project title: Projections Based Learning

A simple generalization of decision tree based learning methods involves, iteratively, the projection of the training set into the space of a small subset of the attributes, derivation of a best classifier in that small space, and the introduction of the obtained signal as a new attribute for the original problem. Under some specific conditions this can be shown to converge to a good classifier. The project would involve extensions of this idea, characterizations of wider classes when the method can be guaranteed to converge, and implementations of such learners.

Project #: DIMACS2006-17

Mentor: Avy Soffer,, Department of Mathematics

Dispersive wave equations of mathematical physics - analysis and numerics.

Please revisit this page frequently, additional projects will be posted through December.
Index Index of REU program
DIMACS Home Page
Contacting the Center
Document last modified on March 20, 2006.