2023 DIMACS REU Proposed Project Descriptions

IMPORTANT! Read carefully because the application system has changed!

- If you are interested in any of the projects 1-11 you have to apply through the NSF ETAP system through this link: https://www.nsfetap.org/award/473/opportunity/455
- If you are interested in any of the projects 12-21 you have to apply through the DIMACS REU application portal here: https://reu.dimacs.rutgers.edu/apply
- If you are interested in topics in both lists, you will have to apply TWICE using both links above.

*Note that you must be a U.S. Citizen or Permanent Resident to be eligible for these projects.

Tight Bounds on Approximate Minimum Cuts in Streaming Models

Project #: DC2023-01
Mentor: Sepehr Assadi, Computer Science

Abstract: Given a graph G=(V,E), the chromatic number of G is the minimum number of colors needed for coloring vertices of the graph so that no edge receives the same color on both its endpoints. Consider the following problem: we have a graph G=(V,E) whose edges are partitioned between two parties, say, Alice and Bob. How much communication is needed between the players before they can compute an alpha-approximation to the chromatic number of G? This problem is known as the communication complexity of the graph coloring problem and will be the focus on this project. Along the way, there are also several intermediate problems one can examine, for instance, for one-way protocols that only involve Alice sending a single message to Bob.

Hardness of Approximating Clustering objectives

Project #: DC2023-02
Mentor: Karthik Srikanta, Computer Science

Abstract: A lot of attention has been invested over the last three decades by the theoretical computer science community to understand the efficiency of (even approximately) computing optimal clusters. Despite this focus, our understanding of the approximability of these clustering tasks remains poor, and our understanding of the hardness of approximation of these tasks was even worse. Recently, in [1,2,3], the authors have established a conceptual framework to tackle the latter issue of hardness of approximation of k-means and k-median objectives in L_p metrics (and in particular Euclidean metric), through which there are now significantly improved inapproximability bounds. In this project, you will learn about the ideas and techniques developed in [1,2,3] and try to implement them for clustering objectives such as sum of all intracluster pairwise distances, sum of radii of clusters, sum of diameters of clusters, maximum cost of any cluster, and so on. The student is expected to have some basic background in algorithms, graph theory, and geometry (analysis).
[1] https://hal.archives-ouvertes.fr/hal-02360762v2/document
[2] https://arxiv.org/abs/2010.00087
[3] https://eccc.weizmann.ac.il/report/2021/177/

Algorithms for Streaming Tournaments

Project #: DC2023-03
Mentor: Prantar Ghosh, DIMACS

Abstract: A tournament is a digraph (directed graph) where every pair of vertices shares exactly one directed edge. We are interested in studying tournaments in the streaming model where the input graph is built up by a sequence of edges, and we need to maintain a summary of it using low memory so as to answer certains questions about the graph. Sometimes we are allowed multiple passes over the edge stream to reach a solution. A recent paper [Chakrabarti, Ghosh, McGregor, Vorotnikova, SODA2020] shows that in streaming, many classical problems turn out to be hard for general digraphs, i.e., they are provably unsolvable without storing almost the entire graph; on the other hand, they show that these problems are tractable when the digraph is a tournament. They design a collection of efficient algorithms for these problems, but hardly rule out better ones. We want to investigate whether we can design more efficient streaming algorithms for classical problems on tournaments, or whether we can prove optimality of existing algorithms. One concrete problem is finding the minimum Feedback Arc Set (FAS). An FAS in a digraph is a set of arcs (directed edges) whose deletion effectively removes all directed cycles. Since this set can have very large size, a deterrent for the streaming model, we want to solve an equivalent version of the problem where we need to output only an ordering of the nodes such that the number of back edges (edges going backwards in the ordering) is minimized (the back edges form an FAS of the graph). We can also ask the question of simply finding the size of the minimum FAS (FAS-size). While prior work has designed algorithms that give approximate solutions to FAS and FAS-size for tournaments, neither did they rule out exact algorithms nor did they prove that the same approximations cannot be achieved in fewer passes or using smaller space. In fact, even the existing lower bounds for FAS in general digraphs only rule out multiplicative approximations; whether we can achieve estimates within a reasonably small additive factor is unknown and can be an interesting topic of study.
Prerequisites: A basic course on probability

Differential privacy and data visualization

Project #: DC2023-04
Mentor: Anand Sarwate, Electrical and Computer Engineering

Abstract: Differential privacy is mathematical framework for quantifying the privacy risks when computing using sensitive data about individuals. This is a statistical/probabilistic notion of privacy which can be used to generate privacy-protecting summaries of data. The 2020 US Census used differential privacy to publishing results from the 2020 decennial census and many government agencies are trying to learn how to use differential privacy for their work. In many cases, private data is summarized through visualizations and descriptive statistics. Much of the work on differential privacy has focused on machine learning and data publishing. The student on this project will survey existing work in differential privacy and private visualization, choose a type of visualization, and evaluate the methods on data sets. The goal is to implement different data visualization techniques and see how privacy affects our qualitative (visual) understanding of the structure of data.
Research assistants will be responsible for researching current techniques for visualization with privacy and applying them to data sets from an application jointly chosen with the advisor. An ideal candidate will have some exposure to statistics and some experience with plotting data using a programming language (Python preferable, MATLAB is OK too).
The learning objectives are structured around three axes: (i) understanding the structure scholarly research, (ii) translating research into practice (through programming), and (iii) communicating research results. The first objective (i) will be addressed by completing a comprehensive literature review on the topic to get a sense of the "lay of the land" and opportunities to pursue unexplored directions. The second objective (ii) will involve programming and implementing privacy-preserving algorithms using the Python programming language. Additional support and mentoring will be provided by graduate student researchers working in the lab. The third objective (iii) will be achieved through presentation practice at group meetings and poster presentations with feedback from the professor and students in the lab.

Genome folding and function: from DNA base pairs to nucleosome arrays and chromosomes

Project #: DC2023-05
Mentor: Wilma Olson, Chemistry and Chemical Biology

Abstract: Despite the astonishing amount of information now known about the sequential makeup and spatial organization of many genomes, there is much to learn about how the vast numbers of DNA base pairs in these systems contribute to their three-dimensional architectures and workings. The chemical structure of DNA contains information that dictates the observed spatial arrangements of base pairs within the double helix and the susceptibilities of different base sequences to interactions with proteins and other molecules. Our group has been developing software to analyze the spatial and energetic codes within known DNA structures and using these data in computational studies of larger nucleic acid systems (1). By representing the DNA bases as sets of geometric objects, we can capture the natural deformability of DNA as well as the structural details of ligand-bound interactions (2). There are now sufficient data to examine the influence of sequence context, i.e., the effects of immediate neighbors, on the apparent motions of the bases and to develop simple functions that capture this information. The observed behavior can be incorporated in computational treatments to improve our understanding of macromolecular systems in which DNA sequence and protein binding play important roles. For example, DNA elements that regulate gene expression often lie far apart along genomic sequences but come close together during genetic processing. The intervening residues form loops, mediated by proteins, which bind at the sequentially distant sites. We have developed methods that take account of both DNA and protein in simulations of loop formation and successfully captured experimental observations (3,4). Our recent studies of long-range communication between regulatory elements on short nucleosome-decorated DNA (5,6) provide a basis for linking the 'local' arrangements of nucleosomes in these systems to higher-order macromolecular structures, such as the looped architectures detected within chromosomes. We are developing new algorithms to characterize the spatial arrangements of nucleosomes and incorporating the collective information in nucleosome-level depictions of chromosomal DNA (7,8). The build-up of structural components in our treatment of DNA and chromatin offers new insight into the stepwise processes that underlie genome organization and processing.

1. Olson, W.K. 2020. "Insights into DNA and chromatin from realistic treatment of the double helix," in Paul Flory's 'Statistical Mechanics of Chain Molecules' Today, A.C.S. Books, Tonelli, A.E. and Patterson, G.D., Eds., American Chemical Society, Washington, D.C., Chapter 9, pp. 143-159.
2. Tolstorukov, M.Y., A.V. Colasanti, D. McCandlish, W.K. Olson and V.B. Zhurkin. 2007. A novel roll-and-slide mechanism of DNA folding in chromatin: implications for nucleosome positioning. J. Mol. Biol. 371:725-738
3. Wei, J., L. Czapla, M.A. Grosner, D. Swigon and W.K. Olson. 2014. DNA topology confers sequence specificity to nonspecific architectural proteins. Proc. Natl. Acad. Sci. USA 111:16742-16747.
4. Clauvelin, N., W.K. Olson. 2021. Synergy between protein positioning and DNA elasticity: energy minimization of protein-decorated DNA minicircles. J. Phys. Chem. B 125:2277-2287
5. Clauvelin, N., P. Lo, O.I. Kulaeva, E.V. Nizovtseva, J. Dias, J. Zola, M. Parashar, V. Studitsky and W.K. Olson. 2015. Nucleosome positioning and composition modulate in silico chromatin flexibility. J. Phys.: Condens. Matter 27:064112.
6. Nizovtseva, E.V., Todolli, S., Olson, W.K., and Studitsky, V.M. 2017. Towards quantitative analysis of gene regulation by enhancers. Epigenomics 9:1219-1231.
7. Todolli, S., P.J. Perez, N. Clauvelin and W.K. Olson. 2017. Contributions of sequence to the higher-order structures of DNA. Biophys J. 112:416-426.
8. Todolli, S., R.T. Young, A.S. Watkins, A. Bu Sha, J. Yager, and W.K. Olson. 2021. Surprising twists in nucleosomal DNA with implication for higher-order folding. J. Mol. Biol. 167121.

Genomic data-guided computational modeling of cancer

Project #: DC2023-06
Mentor: Subhajyoti De, Rutgers Cancer Institute of New Jersey

Abstract: Cancer is the second leading cause of death in adults, claiming the lives of more than half a million Americans every year. Tumor starts from a single rogue cell in the body, which becomes defective, and the clones of that initial cell start to grow (or die out) and acquire new mutations as the populations of tumor cells grow. During treatment, the patient initially responds, and the tumor shrinks, but remaining cells inevitably develop mutations that give them drug resistance, and the tumor starts to grow again until the patient dies. It is becoming apparent that computational algorithms and mathematical frameworks can identify subtle patterns in patient profiles, which can help with early detection and personalized treatment, improving their outcome. We will develop computational model to capture principals of population dynamics of tumor cells during development and emergence of resistance against cancer drugs, using concepts based on Galton Watson branching process and reaction diffusion system. We will combine computational models with clinical data from cancer patients to investigate whether we can predict the timeline of tumor development and thereby (i) detect cancer at an early stage or (ii) identify emergence of drug resistance very early during treatment. This project will develop fundamental evolutionary concepts, implement relevant computational models, and apply them in clinical settings to help improve cancer care.

Epigenetic dysregulation of human cancer

Project #: DC2023-07
Mentor: Jian Cao, Rutgers Cancer Institute of New Jersey

Abstract: Our genetic information stores in the DNA sequence in our genome. Almost all cells in the body carry an identical genome. However, they have diverse morphology and function, because cells may read genetic codes in different ways. Epigenetics is the study of these processes. "Epi-"means on or above in Greek, and "epigenetics" describes factors beyond the genetics. Tumor cells commonly hijack epigenetic mechanism to benefit tumor growth. My lab is interested in identifying recurrent epigenetic events in human cancer, studying their mechanism, and developing targeted therapies. REU intern will analyze published tumor datasets and perform association studies to find tumor features associated with certain epigenetic dysregulation.
Preference will be given to candidates with bioinformatics and/or statistics experience.

Deep Learning for Quality Prediction in Metal Additive Manufacturing

Project #: DC2023-08
Mentor: Weihong 'Grace' Guo, Industrial and Systems Engineering

Abstract: This is an interdisciplinary research project integrating data science with engineering applications. Additive manufacturing, or three-dimensional (3D) printing, is an emerging technology that enables the direct fabrication of complex shapes and eliminates the waste associated with traditional manufacturing. The wide adoption of 3D printing for functional metal parts is hampered by the poor understanding of the process-quality causal relationship and the absence of an online data-driven prediction method for part quality. This study aims to develop methods for analyzing emission data (point cloud data) from Selective Laser Melting (a technique in metal 3D printing) and correlating such data with process anomaly or part quality. We are particularly interested in exploring methods in transfer learning and continual learning to see how the quality prediction model trained from one 3D printing experiment can be transferred to experiments under different parameter settings.
Requirements: data analytics, machine learning, python or R.

Intelligent sampling for physics-informed multifidelity learning

Project #: DC2023-09
Mentor: Rajiv Malhotra, Mechanical and Aerospace Engineering

Abstract: Our group is working on tackling the curse of dimensionality in machine learning of practical engineering systems. We have found that biasing machine learning models with surprisingly simple first principle models can drastically reduce the experimental and computational cost of data generation for machine learning, in some cases by an order of magnitude. In this context this project will explore incremental sampling techniques that are crucial to the success of our approach. The REU student will work with an existing graduate student, and implement different incremental sampling methods in the context of transfer learning, with very high potential for a journal publication. Knowledge of python programming in the context of machine learning methods like SVRs, random forests, feedforward neural networks is necessary.

Research in Nonparametric Testing and Estimation

Project #: DC2023-10
Mentor: John Kolassa, Statistics and Biostatistics

Abstract: 1. Theil-Sen regression (Theil, 1950) represents a methodology for estimating a regression coefficient for bivariate continuous data; one of these two variables is referred to as an explanatory variable, and the other is referred to as a response variable. Often observed data is not truly continuous, in that the data contains pairs in which the explanatory variables are tied, or in which the response variables are tied, or in which both variables are tied. The regression approach may be modified to work in this context (Sen, 1968, "Estimates of the Regression Coefficient Based on Kendall's Tau", JASA). Some studies of the properties of this estimator relative to other regression estimators exist (ex. https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=3351&context=oa_dissertations); I propose specifically quantifying the impact of discreteness of the Theil-Sen estimator on efficiency of estimation.
2. The Kruskal-Wallis (Kruskal and Wallis, 1952, "The use of ranks ..., JASA) test of homogeneity of multiple groups results in a test statistic that is the quadratic form of statistics supported on an integer lattice. The distribution of this statistic under the null hypothesis is generally approximated using a chi-square distribution. Yarnold (1972, Annals of Statistics) presents refinements to the distribution of this statistic. One refinement is a correction for continuity, and the other is an adjustment for kurtosis. I conjecture that the first of these refinements, which in other contexts, is dominant, will, in this context, prove to be less important than the second.

Optimization, learning and high-dimensional macroscopic limits

Project #: DC2023-11
Mentor: Pierre Bellec, Statistics

Abstract: The last decade has seen the emergence of new phenomena where complex statistical learning problems such as high-dimensional regression and classification can be accurately summarized by simple systems of equations. These simple systems of equations characterize the high-dimensional limit of the statistical learning problem at hand and provide new insights on regularization and the choice of statistical estimators in high-dimension. The project will explore problems in this line of though, requiring and developing skills in probability, statistical/machine learning, numerical programming and computational linear algebra.

↑↑↑ FOR PROJECTS ABOVE THIS LINE PLEASE APPLY HERE: https://www.nsfetap.org/award/473/opportunity/455 ↑↑↑

↓↓↓ FOR PROJECTS UNDER THIS LINE PLEASE APPLY HERE:https://reu.dimacs.rutgers.edu/apply ↓↓↓

Distance-Preserving Graph Compression / Composable Metric Compression

Project #: DC2023-12
Mentor: Zihan Tan, DIMACS

Abstract: REU students may work on either of these projects:
1. Distance-preserving graph compression. A promising paradigm for efficiently processing large graphs is graph compression, where the goal is to compress the given large graphs into smaller and simpler ones (called sparsifiers) while preserving their crucial structural information. In this project, we focus on distance-preserving graph compression, and in particular, constructing better emulators.
Given an edge-weighted graph G and a subset T of its vertices called terminals, an emulator (of G with respect to T) is a small graph H that contains T, such that for every pair of terminals, their shortest-path distance in H is approximately (say, to within a factor of q) the same as their distance in G. Additionally, H is required to inherit some structural properties from G.
We will focus on obtaining better trade-offs between the size and the quality (the factor q) of the emulator, possibly for some special families of graphs first. We may also study other related notions of distance-preserving compression, e.g., spanners.
2. Composable metric compression. Given a dataset, a coreset is a small subset of the dataset that (approximately) preserves optimal solutions (or only the optimal cost) to some combinatorial optimization problems on it. A crucial property of the coreset is composability, which allows the coreset to be composed with other coresets and used for designing efficient algorithms in other computation models such as distributed and streaming settings.
In this project, we focus on constructing composable coresets or other structural sketches for metric problems (like Minimum Spanning Tree, Steiner Tree, etc.) in low-dimensional Euclidean space. For example, one goal is to design a coreset algorithm A, such that for any pair S, T of point sets in the plane, the MST cost on S \cup T can be approximated using only the information provided by A(S)\cup A(T). The crucial trade-off here is between the size of the coresets (A(S), A(T)) and the quality of the approximation.

Structural Exploration of Multi-Robot Path Planning for Improving SOTA Algorithms

Project #: DC2023-13
Mentor: Jingjin Yu, Computer Science

Abstract: Graph-based Multi-Robot Path Planning (MRPP), also known as Multi-Agent Path Finding (MAPF), seeks to quickly find high-quality paths for routing a large number of robots. MRPP/MAPF has many important applications, e.g., enabling the coordination of thousands of robots at Amazon warehouses. Whereas a significant amount of work has been done in the field with many decent algorithms proposed, we are still in the relatively early stages in fully understanding the structure of the problem and utilizing the insight to design more capable MRPP/MAPF algorithms. For example, current algorithms are mostly agnostic to the graph structure - can we leverage that information?
In this summer project, the student researcher will learn about the existing SOTA in MRPP/MAPF, seek a better understanding of particular structures, and use that to improve the performance of the SOTA algorithms.
Jingjin Yu's website: https://arc-l.github.io/pub.html
Some relevant papers:
https://arxiv.org/pdf/1904.02598.pdf (https://www.youtube.com/watch?v=0MUGrg5CphM)
https://arxiv.org/pdf/2201.08976.pdf (https://www.youtube.com/watch?v=v8WMkX0qxXg)

Truth Learning in a Social Setting

Project #: DC2023-14
Mentor: Jie Gao, Computer Science

Abstract: Individuals gather information by both independent observations and communication with others in a social environment. Social Learning studies how dispersed and self-interested agents aggregate information. In this project we study whether social learning can enable truth learning. It is well known that learning can fail due to herding, where agents collectively focus on an early false consensus and ignore subsequent observations. Therefore a general goal in social learning is to characterize when truth learning is guaranteed. In this project we study the computational problem of social learning with different observation models, the environment of learning and social processes such as peer pressure, social influence and network effect.

Criminal Enterprises & Social Media

Project #: DC2023-15
Mentor: Christie Nelson, Masters of Business and Science in Analytics and CCICADA

Abstract: Human trafficking is modern day form of slavery. Child trafficking is a particularly tragic part of the large, lucrative criminal enterprise of human trafficking. In 2016, New York City's Administration for Children's Services identified 2,400 child trafficking victims in New York City and, in that same year, Texas identified about 79,000 child trafficking victims. In 2017, the National Center for Missing and Exploited Children (NCMEC) received more than 10.2 million reports of child exploitation. This number greatly exceeds the current national and global estimates for a recognized pandemic - Covid-19. Most of law enforcement's approach reflects a reactive effort to support the investigation and interdiction of human trafficking. A more proactive effort would be to disrupt human trafficking strategically with evidenced-based predictive models of how traffickers adapt.
This project will explore social media messages to begin to identify features of advertisements that are thought to be made by organized criminal enterprises versus a single individual. It will involve supervised and unsupervised data modeling approaches. It will also involve data wrangling, visualization and more.

Projects in Pure Math

Project #: DC2023-16
Mentor: Faculty of the Math Department, Mathematics

Abstract: The Math REU Coordinator is Prof. Kristen Hendricks. Several projects will be run in 2023 by members of the pure math faculty, in addition to the ones described below. Applicants interested in a project in pure mathematics should list this project in their application and indicate their specific Math interests; further details on the projects will be worked out in the Spring.

Concordance Invariants of Satellite Knots

Project #: DC2023-17
Mentor: Kristen Hendricks, Mathematics & Abhishek Mallick, Mathematics

Abstract: A knot is a smooth embedding of the circle into the three-sphere S^3. A fundamental operation of producing new knots out of existing ones is called the satellite operation. One such set of satellite knots comes from the so-called Mazur pattern. In general, knots are often studied up to a notion of equivalence, called knot concordance. Two knots are said to be concordant if they jointly form the boundary of an annulus in S^3 x [0,1]. P. Ozsvath and Z. Szabo defined an invariant of the concordance class of a knot, called the tau-invariant. This invariant has been very useful in distinguishing concordance classes of knots.
In 2016, A. Levine used a knot invariant called bordered Floer homology which is well-adapted to studying the satellite construction to give a formula of calculating the tau-invariant of satellite knots with Mazur patterns. In this project, we will generalize Levine's work by considering more general Mazur patterns. These patterns arise from the so-called (1,1)-patterns: patterns that can be drawn on the surface of a torus. We will then give an explicit formula of the tau-invariant for many different families of (1,1)-Mazur patterns. This will involve studying the bordered Heegaard diagrams of (1,1)-knots and the computation of the box tensor product of the bordered Type A and D modules, the work of which is essentially a combinatorial computation.

Aspects of Lagrangian Fillings

Project #: DC2023-18
Mentor: Christopher Woodward, Mathematics


Simplifying explicit equations of fake projective planes

Project #: DC2023-19
Mentor: Lev Borisov, Mathematics

Abstract: Fake projective planes are special complex surfaces (four real dimensions). It is known that there are exactly 100 of them. Some of these surfaces are given by explicit equations which allows one to investigate their more subtle features, which is the goal of the project. Computations will involve using various software packages, most notably Mathematica, Magma and/or Macaulay2. An ideal participant will have some knowledge of algebraic geometry and some familiarity with mathematical computing.

Correlation inequalities in Algebra, Combinatorics and Probability

Project #: DC2023-20
Mentor: Swee Hong Chan, Mathematics

Abstract: Consider three teams -- Aqua Antlers, Blue Bisons, Crimson Cheetahs-- playing in a fantasy football league. Aqua Antlers is playing against Blue Bisons this week and will be playing against Crimson Cheetahs next week. Intuitively, if Aqua Antlers wins against Blue Bisons this week, then one will expect that Aqua Antlers will become more likely to win against Crimson Cheetahs next week. This intuitive idea turns out to have a rigorous mathematical counterpart called the XYZ inequality and was proved using a nontrivial tool known as the FKG inequality in probability. In this project we will explore the algebraic, combinatorial, and probabilistic aspects of these inequalities, with the aim of generalizing these inequalities and/or proving new correlation inequalities.
Prerequisites: Mathematical maturity; Interests in either one of the fields of algebra, combinatorics, or probability. Familiarity with Python or SageMath will be useful but is not required.

Polyhedral geometry of surfaces

Project #: DC2023-21
Mentor: Feng Luo, Mathematics

Abstract: Polyhedral surfaces are 2-dimensional geometric figures obtained by gluing of Euclidean triangles. The basic examples include the boundary of convex polytopes. The subject is a rich source of inspiration for mathematics. For instance the Euler characteristic comes from topological study of polyhedral 2-spheres and Cauchy rigidity of convex polyhedral surfaces is the first rigidity phenomenon in geometry. In this REU project, we will investigate special types of polyhedral surfaces coming from circle packing and some of the recent works on conformal geometry of polyhedral surfaces.

Pieri formulas for flag varieties

Project #: DC2023-22
Mentor: Anders Buch, Mathematics

Abstract: Many problems in enumerative geometry can be solved by a computation in the cohomology ring of a flag variety. One strategy for efficient computations in this ring is to choose a set of generators such that multiplication with a generator can be done efficiently, and so that arbitrary ring elements can be efficiently expressed as polynomials in the generators. Suitable sets of generators are known for many flag varieties, but not all. The project will look for good sets of generators in the missing cases.

Projects in Graph Theory

Project #: DC2023-23
Mentor: Bhargav Narayanan, Mathematics


Projects in Graph Theory

Project #: DC2023-24
Mentor: Sophie Spirkl, Department of Combinatorics and Optimization, University of Waterloo


Mathematical Logic

Project #: DC2023-25
Mentor: Simon Thomas, Mathematics


Return to REU Home Page
DIMACS Home Page
DIMACS Contact Information
Page last modified on December 21, 2022.