About me

I am a CS undergraduate at Faculty of Mathematics and Physics of the Charles University in Prague, Czechia. I study artificial intelligence, I am interested in logic and abstract algebra, and I did some quantum computing. I also play a lot of jazz piano, from solo to big band.

There is also my personal website.

General information

Introduction

In Euclidean data clustering, we seek to assign input data points into clusters, such that the maximum distance between points within the same cluster is minimized (or alternatively, such that the maximum distance between points from their cluster's center is minimized). Through various reductions of graph problems (\(k\)-coloring, vertex cover), it is possible to show that not only is this task in \(\ge 3\) dimensions NP-complete, but also its approximations are NP-complete. Our aim in this project will be to improve known approximation factors for NP-completeness or polynomial complexity.

Def. Let \((X, \text{dist})\) be a metric space and \(C \subset X\). Define \[ \text{diam}(C) := \text{max } \{ \text{dist}(x,y) \mid x, y \in C \}. \]

Def. Further, for a collection of subsets \(C_1, C_2, \dots, C_k \subset X \), \[ \text{diam}(\{C_1, C_2, \dots, C_k\}) := \text{max } \{ \text{dist}(x, y) \mid i \in [k] \;\; \& \;\; x, y \in C_i \}. \]

Max-\(k\)-Diameter - let \((X, \text{dist})\) be a metric space and let \(k\) be a constant. Given as input \(P \subset X\), find a \(k\)-clustering that minimizes \(\text{diam}(\{C_1, C_2, \dots, C_k\})\).

\(r\)-approximate Max-\(k\)-Diameter - given \((X, \text{dist})\), \(k\), and input \(P \subset X\), let \(\Delta := \text{min }\text{diam}(\{C_1, C_2, \dots, C_k\})\) over all possible clusterings of \(P\). Find a \(k\)-clustering with diameter at most \(r\Delta\).

See the following table for the state-of-the-art approximation factors \(r\).

Metric NP-hardness \(r\) PTIME \(r\)
\(l_\infty\) \(2-\epsilon\) \(2\)
\(l_0 / l_1\) \(1.5-\epsilon\) \(1.304\)
\(l_2\) \(2-\epsilon\) \(1.415\)
State-of-the-art approximation factors. [1]

Weekly updates

Week 1

We had several orientation talks after our arrival and met with our mentor Karthik C. S. and his doctoral student Mursalin. I read the paper [1], outlining the relevant state-of-the-art and known barriers to further research.

Week 2

We made a short introductory talk, in which we outlined the area and directions for further research. We started solving some problems (proving \((2 - \epsilon)\)-approximation PTIME in Boolean space \(\{0, 1\}^n\) under the Hamming distance with some additional constraints) by using previous results and better understanding the used techniques.

Week 3

We started looking into \(k\)-clustering with outliers, where the task is to exclude an arbitrary subset of points of a bounded size such that the \(k\)-clustering performed on the remaining points is optimized. This task has many variations [2]. Of particular interest is \(k = 1\), where the task simplifies into essentially finding the minimum enclosing ball of the remaining points. We tried adopting Trevisan's encoding using Hadamard codes or \(r\)-cloud systems [1] to find some approximation NP-hardness bounds for this problem.

Week 4

Although approximating \(k\)-clustering with outliers w.r.t. only cluster diameters is hard, relaxing the outliers amount to be also approximated makes the problem less complex. We apply the idea of core-sets [3] to solve this problem efficiently. We also started to write down our ideas into a shared Overleaf document. We also established a \(\frac{4}{3}\)-inapproximability of the simple \(1\)-clustering with outliers for the \(\ell_1\) metric, its \(\sqrt{\frac{4}{3}}\)-inapproximability for the \(\ell_2\) metric, and its \(2\)-inapproximability for the \(\ell_\infty\) metric.

Week 5

This week we started to search for graph embeddings into metric spaces that would improve on the previously demonstrated inapproximability factors, optimally as close to 2 as possible under the \(\ell_1\) metric. We have written a computer program that helps us with verifying our graph embedding ideas that we can also use to search for efficient embeddings by brute force, random sampling, or using a genetic algorithm. We found an efficient edge-embeddings of \(P_3\), \(P_4\), and \(C_4\), but unfortuantely we were unable to find efficient edge-embeddings for bigger graphs.

Week 6

We found efficient edge-embeddings for all graphs on 4 vertices and some relevant graphs on 5 vertices. We were trying to generalize this to bigger graphs. We have tried different embedding ideas. This included more Hadamard codes, adjacency-based characteristic functions of all vertex subsets, and trying to generalize patterns from the \(\{0,1\}^n\) embeddings found for smaller graphs by shuffling or composing. We extended our program to accomodate for new types of embedding to help us search through them. The goal is to find an efficient embedding for \(C_n\) for an arbitrary \(n\).

Week 7

We were able to find a 2-edge-embedding of \(C_6\) but generalizing to bigger \(C_n\) proved to be very difficult. We unsuccessfully tried to inductively construct larger edge- or vertex- embeddings (e.g. \(C_{n+1}, C_{2n}, C_{2n + 1}\) from \(C_n\)). Would this worked, it could have allowed us to efficiently embed general 3- or 4-edge-colorable graphs and improve our bounds.

We started trying a "reverse" approach. We encode the desired pairwise distances of the embedded graph into a matrix, and use the Cholesky decomposition to obtain the corresponding points. This yielded embeddings which were sometimes equally good to our previous attempts. However we weren't able to derive a useful generalization from this either.

Week 8

We prepared our final presentation for the REU participants. This, together with a few last mettings with Karthik, also gave us some directions for the 9th week final report. In the presentation, we outlined our findings and explained some of the background constructions needed to prove them. We prepared for our departure for Prague.

Week 9

We are now in Prague. We attended some interesting lectures with exercises at the Charles University while working on our final report.

Acknowledgements

This work was carried out while the author Patrik Zavoral was a participant in the 2024 DIMACS REU program at Rutgers University, supported by CoSP, a project funded by European Union's Horizon 2020 research and innovation programme, grant agreement No. 823748.

I am part of a group of students from Charles University which includes Ben Bencik, Adam Dzavoronok, Guillermo Gamboa, Jelena Glisic, Robert Jaworski, Tymur Kotkov, Sofiia Kotsiubynska, Julia Krizanova, Volodymyr Kuznietsov, Tymofii Reizin, Jakub Sosovicka, Filip Uradnik, and Patrik Zavoral.

References

  1. Fleischmann, H., Karlov, K., Padaki, A., & Zharkov, S. (2023). Inapproximability of Maximum Diameter Clustering for Few Clusters. arXiv preprint arXiv:2312.02097.
  2. Charikar, M., Khuller, S., Mount, D. M., & Narasimhan, G. (2001, January). Algorithms for facility location problems with outliers. In SODA (Vol. 1, pp. 642-651).
  3. Bādoiu, M., Har-Peled, S., & Indyk, P. (2002, May). Approximate clustering via core-sets. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 250-257).