General Information
Project Description
Clustering is one of the most prevalent problems in computer science, with applications ranging from machine learning to bioinformatics. For many objective functions, finding an optimal clustering is NP-hard, which motivates the study of the problem's hardness of approximation. There are many gaps between known hardness results and existing approximation algorithms, an example of which is minimum diameter clustering in the L2 metric. We aim to close gaps of this nature, thus advancing our understanding of clustering.
Weekly Log
- Week 1 (5/31-6/4)
- We went over some foundational results in the hardness of clustering, particularly those concerned with the diameter objective function. I learned about basic reductions from vertex cover and graph coloring to approximate clustering. I was also introduced to existing approximation algorithms, the most well-known of which is Gonzalez's 2-approximation algorithm. After getting up to speed, I spent the week helping compile a document of known hardness results and proofs, brainstorming potential approaches to closing some hardness gaps, and preparing for our initial presentation.
- Week 2 (6/5-6/11)
- We organized our understanding of the different parameters involved in clustering. The two main parameters are the number of clusters (either a fixed constant or potentially unbounded) and the Lp metric space of choice. In the setting of an unbounded number of clusters, hardness within a factor of 2 is known for both the L1 and L∞ metrics, making it a tight bound due to Gonzalez's Algorithm. Meanwhile, only hardness within a factor of 2 · cos(10°) ≈ 1.97 is known for the L2 metric. After unsuccessfully trying to tighten this L2 gap with a modified construction, we decided to move on and consider the setting of a constant number of clusters. A simple reduction from graph coloring shows hardness within a factor of sqrt(5/4) ≈ 1.12, and we were unable to find a way to improve this bound.
- Week 3 (6/12-6/18)
- We began this week realizing that our struggle to push the loose L2 lower bound was a sign that perhaps its upper bound was less rigid than we had thought. We achieved our first nontrivial result of the program, an improved √3-approximation algorithm for L2 clustering with a constant number of clusters. By the end of the week, we were able to extend this algorithm to essentially achieve an approximation ratio of √2. Simultaneously, we began to consider ideas in coding theory to establish hardness bounds for the L1 metric in the setting of a constant number of clusters.
- Week 4 (6/19-6/25)
- This week, we established our first nontrivial result on the side of hardness of approximation, specifically hardness within a factor of 5/4 for a constant number of clusters in L1. We did this by exhibiting a particular hard instance of 3-coloring that can be embedded nicely with a Hadamard-like code. We considered ways to improve this bound even further, but decided that we would likely need a different hard instance of 3-coloring to reduce from. On the algorithmic side of things, we were able to work out an exact clustering algorithm when the number of dimensions of our pointset, in addition to the number of clusters, is constant (in any Lp space).
- Week 5 (6/26-7/2)
- We spent the start of this week committed to finding the appropriate hard instance of graph coloring for proving hardness of clustering in L1. Each of our ideas failed, but we slowly began to suspect the existence of a structural barrier to our techniques. We observed that certain graphs, including paths of seven vertices, cannot be naively embedded in L1. Since it is actually polynomially easy to 3-color graphs that do not contain these particular induced subgraphs, we concluded that our current line of techniques could not give us hardness within a factor of 2. By reframing this notion of naive embeddability as a linear program, we were able to computationally determine an exact threshold ratio of 5/3, beyond which proving hardness would necessitate a new approach.
- Week 6 (7/3-7/9)
- This week, we made significant advances in our hardness results. In L2, we constructed a nice geometric object on the surface of a sphere that gave us naturally embeddable hard instances of 3-coloring; we used it to obtain hardness of essentially sqrt(1 + 1/√2) ≈ 1.306. Meanwhile in L1, we devised a simple embeddability gadget to obtain hardness within a factor of 4/3. By employing a more complicated gadget related to the Chvátal graph (my favorite graph of the summer), we were able to further improve this factor to 3/2. Despite all this progress, we still considered these results minor optimizations, as our techniques were still subject to the 5/3 barrier.
- Week 7 (7/10-7/16)
- This week, we tried to think of ways to achieve hardness within a factor of 2 in L1, which would also imply hardness within a factor of √2 in L2, thus solving both of our problems in a single stroke. Even stronger than our condition for clearing the 5/3 barrier, we realized that our hard instance of 3-coloring would need unbounded odd girth. After observing an even more restrictive condition, we realized that the gulf between our current techniques and the ones needed to close the hardness gap was wider than we had anticipated. In the interest of time, we decided to pause thinking about new ideas and begin working on our final report and presentation.
- Week 8 (7/17-7/23)
- The bulk of this week was spent crafting and preparing for our final presentation. After presenting, we got started on our final project writeup. On Friday, I flew to Prague as part of the exchange program with Charles University.
- Week 9 (7/24-7/30)
- We spent this week completing our final writeup. Organizing all of our results and explaining them neatly turned out to be quite valuable. In particular, we realized that the proof of hardness in L1 was much more ad hoc than the proof of hardness in L2, perhaps a sign that we need to start from a more general framework in the former case. Not much research was done this week, as in addition to the writeup we attended seminars at Charles University in Ramsey Theory, discrete geometry and algorithmic game theory. My favorite talk of the week involved extremal properties of visibility problems.
- Week 10 (7/31-8/4)
- This was the final week of the program, and it coincided with a combinatorics workshop at Charles University. Much of the content seemed to be geared towards experts in particular subfields of combinatorics, but it was definitely nice to learn about the various active areas of combinatorics research. We spent our free time this week doing a ton of sightseeing. Most of our tourism was inside of Prague, but my favorite excursion was actually a hike outside of the city, to Koněprusy Caves.
Presentations
Additional Information
- Mentors: Karthik C.S. , Henry Fleischmann
- Acknowledgements: This work was carried out while I was a participant in the 2023 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186.