PROJECT SUMMARY:

Differential privacy is mathematical framework for quantifying the privacy risks when computing using sensitive data about individuals. This is a statistical/probabilistic notion of privacy which can be used to generate privacy-protecting summaries of data. The 2020 US Census used differential privacy to publishing results from the 2020 decennial census and many government agencies are trying to learn how to use differential privacy for their work. In many cases, private data is summarized through visualizations and descriptive statistics. Much of the work on differential privacy has focused on machine learning and data publishing. The student on this project will survey existing work in differential privacy and private visualization, choose a type of visualization, and evaluate the methods on data sets. The goal is to implement different data visualization techniques and see how privacy affects our qualitative (visual) understanding of the structure of data.

About Me

me

Victoria Parizot

Computer Science at Harvey Mudd College

Project: Differential Privacy and Data Visualization
Program: DIMACS at Rutgers University
Office: Core 440
Email: vparizot@g.hmc.edu
Mentor: Professor Anand D. Sarwate

ABSTRACT:

"Visual Utility of Differentially Private Scatterplots Under US Census Data"

Abstract: In data-driven research fields like healthcare and technology, access to sensitive individual information is crucial for generating valuable insights. However, privacy regulations pose challenges for publishing such data, hindering researchers' access. Differential privacy has emerged as a promising solution, allowing researchers to learn from sensitive data while protecting individual privacy. This research focuses on generating differentially private scatterplots, a common data visualization tool, while retaining visual utility. The approach combines strategies of partitioning data, adding calibrated noise, and post-processing to suppress noise. The study explores factors like ε (privacy level), algorithms, bin size, data distribution, and sample size to understand their impact on visual utility. Two small datasets are used, derived from the US Census and World Health Organization, with direct application for Differential Privacy. The results highlight the trade-offs between privacy and visualization integrity, aiding researchers in making informed decisions to balance privacy and data visualization accuracy. Future directions include exploring more sophisticated point regeneration techniques and integrating differentially private heatmaps techniques to improve visual accuracy.

Weekly Logs:

Week 1 (5/29/23):
This week, I finally got to meet with my mentor in person, where he explained his expectations and provided a reading list to help me and the other undergraduate students get up to speed. I read explanatory articles and technical papers to better understands the underlying priciples for differential privacy and federated learning. I also befriended the other undergraduate students in my lab, and formed a reading group so that we can share and discuss articles throughout the summer. I also met the other students in the DIMACS REU and constructed my website.

Week 2 6/5/23
In preparation for the literary review I aim to complete with my lab mates, we began to sort through a folder consisting of many research publications on data visualization and differential privacy. Upon our advisor's suggestion, we traced the citations to find additional sources. To keep track of our progress, we have began to tag the papers and share notes. As we continue our sorting, we will begin to identify key ideas and narrow the scope for our literary review.

Week 3: 6/12/23
At the begining of the week I began analysis of differentiall private CDFs constructed from histograms. To do so, I needed to learn and implement the Pandas, NumPy, and Matplotlib libraries. My approach began by generating data from either a Gaussian, Laplace, or Beta distribtuion with choosen parameters. Then, I created histograms and added Laplace or Gaussian noise to the counts to generate differentially private counts. I summed up these counts to construct an estimated CDF. I could then analyze my approximate 'noisy' CDF against the true CDF (no noise) against the following metrics: the mean squared error, total variation distance, and L1 distance. Next steps will be to combine my approach with the approach of my lab mates in a collaborative repository.

Week 4 (6/19/23):
I've outlined a project for the remainder of the summer. Alongside the literary review I am completing with my peers, I am going to evaluate the different techniques for making differentially private scatterplots. Differential privacy aims to make it indistinguishable for attackers to know is a person's data is included in the data set. However, each point in a scatterplot represents an individual. In order to make a scatterplot differentially private, one must hide or warp the individual data points. A common technique to make DP scatterplots is to bin, count and add noise to the data. However, this creates a plot that resembles a heat plot more than a scatterplot, causing data curators to be unable to use traditional scatter plot techniques, like fitting. Moreover, it brings to question the ability to use DP scatterplots for identifying relationships between two variables and observing patterns in the data. Through my analysis, I aim to alleviate the impact of DP for scatterplot analysis, preserving the use of traditional techniques. This will involve analysis of multiple factors, such as privacy parameters, bin count, sensitivity, post-processing, and pre-processing.

Week 5 (6/26/23):
I've been going through the preliminary release and subsequent code for "Investigating the Visual Utility of Differentially Private Scatterplots" Panavas et al. and "Visualization and Differential Privacy" Lee. Most notably, Lee reconverts a differentially private 2D-histogram into a scatterplot by randomly generating points within a given bin's area. I plan to incorporate this concept with data-dependent mechanisms --such as DAWA, AHP, AGRID-- for more accurate results.

Week 6 (7/03/23):
The technique of re-adding points, outlined in Week 5, causes the creation of outliers through the addition of noise on low count bins. This leads to intense warping of the LOBF. One method to combat this effect is the implementation of the CBP mechanism to create (n,1) "crowd-blending private" scatterplot [Crowd-Blending Privacy - Gehrke et al.]. Basically, points for a bin are only generated when the bin count > n. Otherwise, there are no points generated. One thing to note is that this post-processing process eliminates outliers that existed in the original data, creating what I believe to be a LOBF comparable to a LOBF for the original data with no outliers. There are currently no known ways to preserve outliers while satisfying (1,inf)-DP requirements [Lee]. However, as DP is applied to datasets where we care about the population and not the individual, this does not seem that bad. I plan to further explore/attempt to implement pre/post-processing that would address this introduction of outliers.

Week 7 (7/10/23):
In Week 7 I implemented post-processing techniques in my code skeleton and explored the possibility of integrating DP heatmap techniques with scatterplot techniques. Under the inspiration of Lee's thesis, I implemented code to generate DP 2D heatmaps, DP scatterplots with regenerated points, and Dp scatterplots with regenerated points and thresholding.

Week 8 (7/17/23):
My final week at Rutgers consisted of preparing for my final presentation and using my code skeleton to run tests with my choosen parameters. My final presentation, which can be found here, outlines the importance and background of my problem, as well as present some of the factor tradeoffs and future directions. However, my presentation only includes preliminary results as I had not yet completed the tests. On Wednesday I presented to my lab, and recieved feedback that I implemented before my official presentation the following day.
Alongside work on my presentation, I also ran my code skeleton on two datasets derived from the 2021 US Census Microdata and the World Health Organization Covid-19 dataset. These tests look at the interaction between different values of epsilon values, bin size, and post-processing techniques for the small data sets (n approx. 1000). I generated 144 plots helping me illustrate the parameter tradeoffs for small sample sizes, with an emphasis on post-processing techniques to increase visual utility. In hindsight, I would automate the testing process so that I would not manually need to change the parameters in my code.
At the end of this week, I will fly to Prague to partake in a 2-week long program at Charles University, consisting of a combinatorics workshop.

Week 9 (7/24/23) PRAGUE:
On Sunday, the day after we arrived, Kuba showed us (The American students) around Prague. We saw the popular tourist attractions, like the Charles Bridge and Vysehrad, as well as familiarized ourselves with how to get to the Charles University math building. The first week in Prague consisted of morning lectures followed by practice problems after lunch. Each day had a different lecturer from Charles University and a different topic. My favorite lectures were on Erdos-Szekeres Theorem and open visibility problems given by Jan Kyncl and Martin Balko, respectively. In the afternoon practice problem sessions the lecturers would present problems, ranging in difficulty. The lecturer encouraged us to work in groups, and would wander throughout offering advice and answering questions. I worked with a different group of students each day, and was able to answer some of the problems. Typically, the lecturer would note which problems had the most engagement and would lead a discussion to the final solutions. Often, the opportunity was given to present our own solutions to our peers. On Thursday, I was able to see my peers present 30-45 minute presentations on their own problems of interest. The first week of the program solicited a lot of collaboration among me and my peers.
After the lectures and problem sessions we were given our own time, which we filled by working on our research papers and exploring the beautiful city of Prague. Our Czech peers were extremely welcoming and on Saturday, we went on a hike to see the salt caves.
---
A full program for Week 1 in Prague can be found here.

Week 10 (7/31/23) PRAGUE:
The second week in Prague consisted of the Midsummer Combinatorial Workshop XXVIII, where participants could give lectures oriented on problems of all fields of graph theory, combinatorics and discrete geometry. Participitation was invite only and included people from all over, including local professors from Charles University and professors from the University of St. Andrews. Between the lectures, we had a coffee break, giving the opportunity to talk and ask questions to lecturers and professors with similar interests to mine.
I also completed my research paper, "Visual Utility of Differentially Private Scatterplots under US Census data," which presented my work from the summer. The process of writing this paper gave me practice in formulating and organizing my ideas and exposed me to Latex formatting practice. My abstract can be read here.

Acknowledgements:

I would like to acknowledge the NSF for supporting my research this summer under grant CNS-2150186, as well as the DIMACS program for the opportunity to research.

NSF logo
DIMACS logo