DIMACS is very proud for the achievements of the 2023 DIMACS REU students during their summer research experience. This page includes the abstracts of the final papers that the students submitted to the program, summarizing their summer research.
Abstract: This project was focused on flag varieties and computations in cohomology rings. I studied symmetric polynomials and covered different basis for this family of functions, as well as computations in one of these basis, the set of Schur functions. I also studied root systems, Weyl Groups and the classification of irreducible root systems. Finally, I worked on some actual cohomology rings and how they connect to root systems and multiplication of Schur functions.
Abstract: Logistic regression is used to make a prediction between two different outcomes based on a data set. Typically, when the sample size is much larger compared to a fixed dimension, maximum likelihood estimation is used to estimate the parameters of the model, in which case the estimate has many convenient properties such as being unbiased. It was shown by Sur and Candes (2018) as well as Salehi et al. (2019) that these assumptions do not hold in the high dimensional regime where the sample size and dimension are proportional. While these two groups were able to find adequate methods of characterizing model performance in high dimensions, there is an absence of work on the performance and impact of bagging on high dimensional logistic regression models. In our case, bagging refers to the method of dividing the larger data set into two or more overlapping subsets and fitting a logistic regression model on each subset before aggregating the parts together. This work aims to show a single scalar that would be able to predict the correlation between the two unaggregated estimates. Drawing from previous results in linear regression and the unbagged setting, we were able to successfully infer this result.
Abstract: We are interested in designing social networks which support the spreading of true information while limiting the ability for disinformation to spread. Having access more information doesn't always lead to discovering the truth as phenomena known as "information cascades" can lead to herd mentality propagating false information. Representing social networks with graphs, we studied two areas: high degree and low degree graphs. High degree graphs have the ability to quickly aggregate information and decipher truth into a few high value nodes with many neighbors. One example we studied was the preferential attachment model by Albert-Barabasi. In low or constant degree graphs, a more nuanced approach is needed to allow truth learning to succeed. We studied two examples of these: an adjusted Connected Caveman graph and a grid structure. Moreover, in large-scale networks such as news portals or content-sharing platforms, having a solid grasp of the underlying network structure and ordering helps optimize the process of dissemination and improve the accuracy of information spreading. In modern society, misinformation is a pervasive issue that can have large societal consequences. As we strive to combat the spread of false information, studying the ways information spreads in a variety of networks becomes crucial.
Abstract: We examined a combinatorial game called Division game. There are 2n coins and two players, Alice and Bob, who want to split the coins so that each player has exactly n coins. In the first turn, Alice picks a coin and Bob decides who gets it. Then, the roles swap and the game continues until one of the players gets n coins. The other player then obtains the remaining coins and the player with the larger sum of coins wins the game. We conjecture that Bob always has a non-losing strategy which we proved for a small number of coins and for special classes of games.
Abstract: In this article, we study the problem of classifying graphs that admit `KT orientations'. A~directed graph $G$ is said to have a KT orientation if there is at most one directed path between any pair of vertices of $G$. We extend the known examples of classes of graphs that admit a KT orientation and those graph families which do not. We show that the problem of determining whether a given graph admits a KT orientation is NP-complete, in particular this also applies if we restrict ourselves to planar graphs. Moreover we provide an algorithm to decide if a digraph with max-degree at most 3 has a KT orientation, while for graphs with max-degree $4$, this remains NP-complete. Finally we construct a graph family with small independence number (sub-linear in the number of vertices), and thus has unbounded fractional chromatic number, that admits a KT orientation.
Abstract: The ribosome is responsible for creating proteins. Understanding how the ribosome itself folds into its final 3-dimensional structure is critical to deepening our understanding of the mechanisms behind proteins, and thus life itself. There is much research that has been done in understanding the 3-dimensional structure of ribosomes, however many of these studies have fallen short in giving an intuitive understanding of the 3D structure. In this project, we used computational models to simulate large subunit ribosomal RNA of Escherichia Coli.
Abstract: In this project, we studied Steiner minimal trees for the points given by the corners of the regular simplex. We provide an explicit formula for the coordinates of the Steiner points of a conjectured Steiner minimal tree for simplexes of dimension 2k-1. Also, we explore the function APTC(T), a notion used to characterize trees with respect to the sum of all distances between the terminals, and proved that the conjectured topology for Steiner minimal trees for the regular simplex minimizes this function over all tree with n terminals and n-2 interior points.
Abstract: Finding the minimum cut of a graph has been studied extensively on its own as a fundamental graph problem and as a tool for understanding other important problems such as connectivity and network flow. We studied the minimum-cut problem in the insertion-only semi-streaming model, where edges of G are received one at a time in an input stream and space usage is confined to Õ(n) = O(n poly log n) bits. As such, we can't afford to store the entire graph at once, so we are forced to compute an approximation to the minimum cut. We present improved upper bound and lower bound results for the (1+ε)- approximation to the minimum cut problem in the insertion-only semi-streaming model. We achieve new results that are optimal in their dependence on ε.
Abstract: Nonparametric statistics is a sub-field of statistics involving minimal assumptions about the distribution of data, making it applicable to the analysis of real-world phenomena. The test of Prentice is a non-parametric statistical test for the two-way analysis of variance using ranks. The null distribution of this test is approximated using the Chi-square distribution. However, the exact null distribution deviates from the Chi-square approximation in certain cases commonly found in applications, motivating adjustments to the distribution. This summer, we presented adjustments to this null distribution, and that of related tests with non-polynomial scoring systems, correcting for continuity, skewness, and kurtosis in the multivariate case.
Abstract: Grid Multi-Robot Path Planning (MRPP) has been extensively studied, but much less is known about the corner-following constraint (CFC) variant, which has applications ranging from automated garages to grocery packing warehouses. On an m1 x m2 grid, the problem has a tight upper and lower bound for the makespan objective when there are Θ(m1m2) escorts, but there is a gap when there are only Θ(m1+m2) or Θ(1) escorts. In addition, nothing is known about the intractability of this problem or its related variants. In this work, we close the gap for an arbitrary number of escorts, finding an expected constant factor algorithm when there are at most min(m1,m2) escorts and a high probability constant factor algorithm when there are more. We also show that the problem is NP-hard under the makespan objective, and by the same reduction, standard grid MRPP is NP-hard also, unifying the two grid MRPP makespan hardness results. In addition, we show that when there is a single escort, the two-colored variant and partial variant are both NP-hard, which significantly impacts the design of efficient systems using the MRPP with CFC paradigm. These results contribute to our understanding of the design of efficient algorithms from both an upper and lower bound perspective.
Abstract: A fake projective plane is a complex surface with the same Betti numbers as CP2 but not biholomorphic to it. We studied the fake projective plane P2fake = (a = 7, p = 2, ∅, D327) in the Cartwright-Steger classification. In this summer, we exploit the large symmetries given by Aut(P2fake) = C3xC7 to construct a simpler embedding of this surface into CP5 as a system of 56 sextics with coefficients in Q(\sqrt{-7}). For each torsion line bundle T ∈ Pic(P2fake), we also compute and study the linear systems |nH + T| with small n, where H is an ample generator of the Neron-Severi group. Finally, we constructed explicit equations of the fake projective plane $(a = 7, p = 2, ∅, D3X7), a closely related fake projective plane to P2fake.
Abstract: We consider the problem of approximate clustering with the diameter objective function. When the number of clusters is allowed to be unbounded, the inapproximability ratio is fairly well understood. However, this problem has not been well-studied when the number of clusters is constant. In this work, we show improved hardness bounds when there are a fixed number of clusters. Specifically, we show that the problem is hard to approximate within a factor of 1.5 in Hamming space, and within 1.304 in Euclidean space. Additionally, we prove several barrier results to known methods of proving hardness, and we provide a polynomial time algorithm for the problem when the number of dimensions is constant.
Abstract: We present the first (non-trivial) Ω(n2) lower bound on streaming algorithms for tournaments on n vertices, demonstrating that (exactly) solving the minimum feedback arc set (FAS) problem on tournaments is hard. This complements recent work on upper bounds for the (approximate) FAS problem on tournaments, and shows that even though acyclicity testing can be solved in near-linear memory on tournaments, some important related questions remain hard. We also investigate a number of fundamental graph problems on tournaments: we prove new upper and lower bounds for s-t distance and reachability, and settle the streaming complexity of acyclicity testing. Finally, we settle the streaming complexity of sink finding in general directed graphs.
Abstract: We study vertex sparsification for distances in planar graphs: given a weighted planar graph G and a subset of k terminal vertices, the goal is to construct an emulator, which is a smaller (weighted) planar graph G' that contains the terminals and exactly preserves pairwise distances between the terminals. We construct exact planar emulators of size O(k^2) in the special where all terminals lie on exactly 2 faces.
Abstract: The 1/3?2/3 Conjecture states that in every finite partially ordered set that is not totally ordered, there exists a pair of elements x and y with the property that at least 1/3 and at most 2/3 of the linear extensions of the partial order place x earlier than y. This conjecture has become one of the more prominent conjectures in set theory, and while no one has succeeded to prove such a result for all partially ordered sets, there are partial results for posets with certain properties. Our goal this summer was to prove that this conjecture holds for a specific set P3,n. which consists of {x1, y1, z1, x2, y2, z2, ..., xn, yn, zn} where xi≤xi+1, yi≤yi+1, zi≤zi+1, and xi≤yi≤zi, for all i.
Abstract: Tumors grow from somatic cells inside the human body, often undetected without any major symptoms. But tumors shed cell-free tumor DNA (cfDNA), as well as proteins, intact tumor cells, and other molecules into the blood and other bodily fluids. As cfDNA travels through the bloodstream, it is degraded by various nucleases, salt, etc. such that when blood is drawn from patients for liquid biopsy for diagnostic purposes, the cfDNA has been fragmented into smaller pieces. The methylation, length, and molecular signatures of these cfDNA fragments have the ability to provide information about the location and type of cancer. However, the methods through which different types of cfDNA are degraded remain a mystery, and there is currently no broad, all-encompassing cancer diagnostic tool that utilizes this data. We utilize a mathematical technique called non-negative matrix factorization (NMF) to identify predominant characteristics of end-sequences of cfDNA from cancer patients with multiple types of cancer.
Abstract: In data-driven research fields like healthcare and technology, access to sensitive individual information is crucial for generating valuable insights. However, privacy regulations pose challenges for publishing such data, hindering researchers' access. Differential privacy has emerged as a promising solution, allowing researchers to learn from sensitive data while protecting individual privacy. This research focuses on generating differentially private scatterplots, a common data visualization tool, while retaining visual utility. The approach combines strategies of partitioning data, adding calibrated noise, and post-processing to suppress noise. The study explores factors like ε (privacy level), algorithms, bin size, data distribution, and sample size to understand their impact on visual utility. Two small datasets are used, derived from the US Census and World Health Organization, with direct application for Differential Privacy. The results highlight the trade-offs between privacy and visualization integrity, aiding researchers in making informed decisions to balance privacy and data visualization accuracy. Future directions include exploring more sophisticated point regeneration techniques and integrating differentially private heatmaps techniques to improve visual accuracy
Abstract: Satellite knots are a method of producing new knots from existing ones, utilizing a pattern knot P inside a solid torus and a companion knot K. Two knots are said to be concordant if they form the boundary of an embedded annulus in S3 x [0,1]. P. Ozsvath and Z. Szabo defined an invariant of the concordance class of a knot, called the tau-invariant, and in 2014, A. Levine computed tau for satellite knots with Mazur patterns, which loop twice clockwise around the torus before turning around and looping once counterclockwise and joining back. We determined a formula for tau for generalized Mazur patterns with m clockwise turns and n counterclockwise turns by computing the box tensor product of the CFA- and CFD modules associated to the pattern and exterior of the companion knot, respectively.
Abstract: Powder Bed Fusion (PBF) is an extremely lucrative additive manufacturing technique that allows for highly customizable parts, increased resource efficiency, and reduced cost. However, PBF is prone to overheating which results in part defects. One possible solution to this deficiency is to employ a machine learning based, data driven method to predict overheating before it occurs. Such a method must predict process behaviors with high accuracy, provide explanations and clarification for the prediction, and obey physics principles. In this paper, methods of meeting these three conditions are explored, and a resulting technique is proposed.
Abstract: Given a Legendrian submanifold Λ of a circle-fibred contact manifold Z over a symplectic base, one can ask whether Λ admits a Lagrangian filling L. In this paper, we disprove the existence of such a filling for the Legendrian given by the n-fold power of the Riemman Sphere CP1. Wehrheim and Woodward proved a dimension requirement of the image of H(Λ) in H(L). We use this obstruction and a combinatorial count of relations between elements in H1(Λ) and Hn-1(Λ) induced by the filling to show an obstruction for fillings of (CP1) n for n>2.
Abstract: In this study we focused on the prevalence of ARID2 mutations in publicly available datasets and their association with known melanoma genetic features. Leveraging publicly available data from cBioPortal, we conducted comprehensive mutation analysis and statistical tests to explore the associations between ARID2 mutations and other melanoma-associated mutations, including BRAF, NF1, NRAS, KRAS, and HRAS. Our findings shed light on the potential functional significance of ARID2 mutations in melanoma biology
Abstract:
For each positive integer n, let Sn be the symmetric group on { 1,2, ..., n }. Then a group G is said to be locally embeddable into finite groups if for every finite subset F ⊂ G, there exists an injection φ : F → Sn for some n ≥ 1 such that whenever g, h, gh ∈ F, then φ(gh) = φ(g)φ(h). In this case, we say that G is an LEF group.
In the group theoretic literature, LEF groups are usually characterized in terms of embeddings into ultraproducts of finite symmetric groups. It is natural to ask whether there is a characterization in terms of the more concrete notion of a reduced product of finite symmetric groups. In more detail, let P = ∏n ≥ 1 Sn be the full direct product and let N be the normal subgroup of elements (πn) ∈ P such that πn = 1 for all but finitely many positive integers n. Then the reduced product is the quotient P0 = P/N.
We have shown that it is neither provable nor disprovable using the classical ZFC axioms of set theory that if G is a group such that |G| ≤ 2ℵ0, then G is an LEF group if and only if G embeds into P0. We have obtained results concerning characterizations in terms of nonprincipal ultrafilters. Certain subtle questions about the nature of the independence remain. We have obtained analogous results for sofic groups.
Abstract: Logistic regression is a supervised classification algorithm that predicts the probability of an event given a set of features. This technique relies on maximum likelihood estimation (MLE) to estimate the regression parameters for the log-odds of the observed data. The MLE, however, does not always exist. In particular, if there exists a linear decision boundary separating the classes, then the MLE will not converge. In 2018, Candes and Sur proved a theoretical phase transition curve for the existence of the MLE in binary logistic regression parametrized by the ratio of features to number of observations p/n and the norm of the regression coefficients. In our project, we are interested in finding an empirical and theoretical phase transition curve for the existence of the MLE in unregularized multinomial regression for K classes (K>2).