We are interested in extending various graph algorithms to a setting where we have only limited queries to a strong oracle, but unlimited queries to a noisier weak oracle. This paradigm was proposed in KwikBucks [ICLR 2023],
to solve the correlation clustering problem, but we believe it could be applied to the Densest Subgraph problem.
Research Log
Week 1 (5/29-6/2)
I began this summer by getting familiarizing myself with the KwikBucks paper [3], which introduced the strong and weak oracle paradigm, as well as some of the existing greedy algorithms for densest subgraph [1] [2]. I also prepared a brief presentation on the basics of my project to present to the REU group.
In this setting, we have access to two different oracles to obtain information about our graph. First, we have unlimited node queries to our γ-noisy weak oracle, which returns a superset of the neighbors of the queried node of size at most 1+γ times the true neighborhood's size. Additionally, we have a limited budget of edge queries, o(n2<\sup>), to a strong oracle that correctly outputs whether an edge is present in the graph.
Week 2 (6/3-6/9)
This week, I focused in on the greedy desnsest subgraph algorithm of Bahmani et al. [1]. It operates by making passes over the graph and removing all nodes that have degree less than the threshold of 2(1+ε)ρ(S), where ρ(S) is the density of the remaining nodes S, and ultimately obtains a 2(1+ε) approximation in log1+εn passes. I came up with two naive extensions of this algorithm using only the information generated by the weak oracle. By using the same threshold for the observed degrees, we will underprune nodes, i.e. remove only a subset of the nodes that the actual algorithm would remove. Alternatively, we can set a threshold of 2(1+γ)(1+ε)ρ(S) and instead overprune nodes, making sure that we remove each node that would be removed by the actual algorithm.
I then worked through the calculations to see how far the (up to) 1+γ multiplicative noise on observed degrees, and thus density, would propogate in the analysis. The overpruning algorithm achieves a 2(1+γ)3(1+ε) approximation while running in the same number of passes, whereas the underpruning algorithm obtains a substantially better 2(1+γ)(1+ε) approximation, at the cost of no longer running in logarithmic number of passes for non-negligible γ. However, both of these algorithms rely on a crucial, and incorrect, assumption about the 1+γ upper bound on neighborhood size holding as the true neighborhood shrinks during node pruning. It remains to be seen if this assumption can be corrected for by judicious use of the strong oracle.
Week 3 (6/10-616)
I first explored different approaches for strong oracle queries so that we can lower the 1+γ upper bound on neighborhood size as the true neighborhood shrinks. However, all of the remotely viable approaches that we found required Ω(n2) strong oracle queries and were therefore ineffective. Further thought on this led to the conclusion that our blackbox formulation for the weak oracle was not sufficient for the densest subgraph problem. Specifically, the performance of the correlation clustering algorithm is the aggregate of performance on a subgraph level across the graph; thus, the algorithm can allow for poor performance on some portions of the graph, when the strong oracle is not queried, and maintain a decent solution. On the other hand, the greedy densest subgraph algorithms [1][2] rely on the existence of a set S at an arbitrary iteration of execution such that cρ(S)≥ρ(S*), where S* is the densest subgraph of the graph and c is your approximation ratio.
In light of this, I looked to construct a probabilistic weak oracle. Specifically, we want the observed neighborhood size to be equal to 1+γ times the true neighborhood's size in expectation. From this, we have that the weak oracle is more likely to output non-neighbors of a node if that node has higher degree. While at first this task seemed reasonable, I ran into the complication that both endpoints of an edge can have different degree. I was unable to get around this issue, and ultimately was unable to come up with any consistent probabilistic weak oracle with the desired behavior.
Week 4 (6/17-6/23)
In this week, we pivoted away from the KwikBucks [3] strong and weak oracle paradigm and moved towards a new probabilistic model, partially inspired by Vihan Shah's talk on Learning-augmented Maximum Independent Set at the Rutgers Theory of Computation Reading Group the past Wednesday. In our new model of computation, we observe a noisy graph where existing edges appear with probability 1-γ and missing edges appear with probability γ. The benefits of this model are immediate; since each edge is a simple Bernoulli variable, the observed degree of each node is simply a random variable equal to the sum of two binomial variables. From this, I calculated the expected values of the observed degrees and observed density, both in terms of the true degree and density respectively. I then began looking into how to extend the algorithm from Bahmani et al. [1] to this setting.
Week 5 (6/24-6/30)
Taking a step back to look at the big picture, my mentor Shahrzad Haddadan and I first laid out the statements necessary for the greedy algorithm to hold in this noisy setting. A key idea is that the observed degree of nodes in the densest subgraph S* must not be much smaller than the true degree, and likewise, the observed density of the current subgraph S must not be much greater than the true density. If both of these hold, we are able to continue with the original analysis and have a functional algorithm. With significant sacrifice in the quality of the approximation, I was able to prove these statements with high probability by a pair of Chernoff-Hoeffding bounds. This led to my first result for the program, in which I obtained an approximation with log(n) multiplicative error and even worse additive error.
On another note, on Wednesday 6/26 I gave an hour and a half talk at the Rutgers Theory of Computation Reading Group! I presented my research, Most Vital and k Most Vital Nodes for Max-Flow in Degree 2-Bounded Satellite Networks, from this spring at the University of Southern California's Quantitative Evaluation and Design Group under Leana Golubchik, Marco Paolieri, and Samir Khuller (of Northwestern University).
Week 6 (7/1-7/7)
This week, we decided to split our proof into two sections, where S is respectively larger or smaller than clog(n)/ε for some constant c. From further calculation, I found that this reduces the log(n) multiplicative error down to a constant, and additionally lowers the additive error somewhat. This is a great improvement, but the additive error is still far large than we would like. We think that the strong oracle will come into play for the second part of the algorithm when S is small, as the number of edges in the remaining subgraph S should be in o(n2), and thus we can query each edge and run the original greedy algorithm without any noise.
Week 7 (7/8-7/15)
Continuing my anlysis from last week, I finished calculating the messy details of the algorithm in the case of large noise. From this, I then analyzed the algorithm's effectiveness on cases with constant amount of noise. It took a while to figure out how to piece together the bounds we needed, but ultimately we obtained constant error additively and multiplicatively, which was our goal.
Week 8 (7/16-7/22)
This Wednesday, I gave my final presentation on my research from this summer. Most of my time went into preparing for this, and then subsequently traveling to Prague where I will be staying at Charles University for two weeks! On the research side, I did conjecture a lower bound on the magnitude on the additive noise and begin considering techniques to prove it.
Week 9 (7/23-7/30)
I spent the last week of my REU at Charles University in Prague, where I got to attend daily seminars in Combinatorics and Discrete Geometry. I was also working hard to prepare a research paper with all my work this summer, which was both rewarding and insightful.
References
[1] Bahmani, B., Kumar, R., and Vassilvitskii, S. (2012).
Densest subgraph in streaming and mapreduce.
CoRR, abs/1201.6567.
[2]
Charikar, M. (2000).
Greedy approximation algorithms for finding dense components
in a graph.
volume 1913, pages 84–95.
[3]
Silwal, S., Ahmadian, S., Nystrom, A., McCallum, A.,
Ramachandran, D., and Kazemi, S. M. (2023).
Kwikbucks: Correlation clustering with cheap-weak and
expensive-strong signals.
In The Eleventh International Conference on Learning
Representations.
Acknowledgements
This work was carried out while the author Eli Friedman was a participant in the 2024 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186.