How do you provide information for large dynamic graphs while maintaining full privacy if a breakin occurs? This question has been asked before on general large statistical datasets, but has not yet been addressed for queries specific to graphs.
In this project, I plan to formulate what pan-privacy [DNPRY] means for graph streaming algorithms. Then, I will try to develop algorithms in the semi-streaming model for graph queries that maintain pan-privacy. I will also try to understand what limitations pan-privacy imposes on graph streaming algorithms.
Week 1:
I spent this week reading background material relevant to my project. In particular, I read [CM1] in order
to better understand streaming algorithms and [DNPRY] as an introduction to pan-privacy. I then began to look into
graph streaming algorithms [M] to see how to apply ideas from [DNPRY] to queries such as connectivty in graphs.
I created a presentation introducing and motivating my project for the summer.
Week 2:
I continued reading background material for my project this week. I read [MMNW] as an example of sketches for
pan-private streaming algorithms. These work in the dynamic setting, where insertions and deletions are allowed.
I also looked at a survey [G] on property testing literature for graphs. Property testing could motivate other
approaches or queries for pan-private streaming algorithms.
I then started trying to apply techniques of [DNPRY] and [AGM] to estimating the fraction of small subgraphs, like $k$-Cliques,
in a graph. These techniques apply to the insertion-only setting.
Week 3:
I worked on pan-private triangle counting in graphs. I ran into problems when trying to come up with a multiplicative error
rather than an additive error. I started reading about non-pan-private techniques used to count triangles in graphs [CJ].
Hopefully pan-private techniques will be applicable to some of these methods to get interesting results.
I started looking at a simplier problem of counting degrees of vertices in a graph.
This is interesting to look at in some variations of the pan-private model described
in [D].
Week 4:
Many non-trivial algorithms on data streams in the graph model require sampling uniformly random edges or vertices.
Unfortunately sampling cannot be done in a pan-private manner, so we would need a completely different
approach to come up with pan-private graph algorithms. As a result, I have turned my attention towards looking at
pan-privacy for geometric problems.
I'm considering the model where points are inserted or deleted in a stream to maintain a current point set.
At any time, I want to answer queries on the current point set such as the area of the convex hull, the diamater
of the points, or the cost of a minimum spanning tree connecting the points. Not much has been done in this area
related to differential privacy, let alone pan-privacy. [CM2] and [I] provide nice solutions to various geometric queries on data
streams. [FFKN] studies private coresets for summarizing point sets.
I am now turning my attention to problems like these. I first need to come up with a meaningful definition of privacy
in the geometric stream model. Then I want to try to adapt previous results to get pan-private algorithms for meaningful
queries.
Week 5:
This week I continued to look at pan-privacy in the geometric stream model. Motivated by the approaches used in [I],
I am layering a discretized grid, or a layer of grids, over the space the points can lie in. Then I want to see
what queries I can approximate using only aggregate statistical data on the cells in the grid that can be
computed pan-privately.
One big problem is the apparent need for an additive error term in various geometric approximations.
For example, adding or moving one point in a stream can arbitrarily alter the answer to many queries.
I want to come up with a realistic model that gets past this limitation.
Week 6:
I tried different approaches for pan-privately computing the cost of an MST in the geometric setting. None of these could get past an additive error. This seems to be the case for computing $k$-median, convex hull, and other problems as well. However, a multiplicative error seems plausible for the facility location problem. I am reading through [CLMS] to see if techniques from that paper could be implemented pan-privately.
Week 7:
I continued to work on the facility location problem in the pan-private setting. I also created my final presentation for the REU program.
I also spent some time this week at the DIMACS workshop in cryptography this week.
Week 8:
This week, I travelled to Prague. We saw 4 lectures in prepartion for the conference next week. The lectures covered: 1) Computing the hamilton path of interval graphs in polynomial time, 2) using tree width to approximate the size of the largest independent set in planar graphs, 3) the chromatic polynomial and combinatorial reciprocity, and 4) interesting properties of THE countably infinite random graph.
I also worked on writing my final report for my summer project.
References
[AGM] K. J. Ahn, S. Guha, A. McGregor. Graph Sketches: Sparsification, Spanners, and Subgraphs.
[CJ] G. Cormode and H. Jowhari. A Second Look at Counting Triangles in Graph Streams.
[CM1] G. Cormode and S. Muthukrishnan. An Improved Data Stream Summary: The Count-Min Sketch and its Applcations.
[CM2] G. Cormode and S. Muthukrishnan. Radial Histograms for Spatial Streams.
[D] C. Dwork. Differential Privacy in New Settings.
[DNPRY] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and S. Yekhanin. Pan-Private Streaming Algorithms.
[FFKN] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private Coresets.
[G] O. Goldreich. Introduction to Testing Graph Properties.
[I] P. Indyk. Algorithms for Dynamic Geometric Problems over Data Streams.
[M] A. McGregor. Graph Stream Algorithms: A Survey.
[MMNW] D. Mir, S. Muthukrishnan, A. Nikolov, and R. N. Wright. Pan-private Algorithms Via Statistics on Sketches.