Student: | Thomas Chen |
---|---|
Office: | CoRE 450 |
Mentor: | Ruobin Gong |
School: | University of California, Berkeley |
E-mail: | tochen920@berkeley.edu |
Project: | Data Privacy and Applied Social Science Research |
As we continue to use large public datasets for studies and models, there are increasing concerns over the risk of sharing public data. Many studies have shown that potential malevolent agents can cross-reference public datasets to find out sensitive information about individuals in public datasets. One such theoretical remedy that has been suggested is to use differentially private synthetic datasets in place of the original dataset. However, no studies so far have evaluated these synthetic data generators on actual real-life studies and measured how effective they are at replicating results. In this project, we look at the current differentially private synthetic data generators and evaluate them on recent social science studies that use large public datasets. The challenge is to find the right balance between privacy and the usability of the data for research.
This week, I read 2 papers focusing on the current differential privacy methods on synthetic data. We have decided to focus on 2 methods, PrivBayes and DPSyn that seem promising for the PSID dataset, which is what we are focusing on. Additionally, I familarized myself with the PSID dataset and decided to find papers that would be useful to start with implementing one of the differential privacy methods above.
I was able to implement PrivBayes, a method that was described in the differential privacy papers last week. The basic implementation has a lot of moving parts that we are looking more into. I decided to start with some data analysis on the specific PSID dataset I chose to see what we can do to improve the methods.
We explored more data synthesis techniques this week as well as delving deeper into the PrivBayes paper. We were able to figure out a lot of the hyperparameters that we needed to tune for the the data synthesis and preformed experiments to see which combination performed the best in the analysis.
We were able to implement PrivBayes analysis on one of the PSID Studies focusing on intergenerational transfer of wealth and time. Unfortunately, it seemed that the Synthetic data still did not perform well compared to the original data on the analyses in the study. We are looking into ways to improve the similarity by considering data inputation in order to get better results.
We were able to figure out what was causing some of the issues with the synthetic data this week as it turns out that there was some interesting issues with how the code was generating the Bayesian Network. As such, we decided to implement some new code that could fix this situation. Additionally, we've decided to start replicating the tables in the Sandwich study that we have chosen.
This week was spent replicating a lot of the results from the Sandwich study. I finished implementing the code from last week which helped with the synthetic data analysis. Additionally, we've also decided to look at another study from CPS that focused more on longititudal data, which adds another dimension of analysis on the synthetic data analysis
I was finally able to get the full pipeline for the tables in the Sandwich study to work! We started to observe how some of the variables involved in the synthetic data anlysis started to impact the results from the study. I've also made further progress in the CPS study and replicating the results in that study.
This week, I spent my time working on the final presentation for most of the week and presented my results on Thursday. We also started to replicate experiments looking at some of the variables that affected the synthetic data. I drew up a lot of different visualizations based on the results this week.
It's the final week! Most of my time was spent working on the final paper and polishing up some of the results from the study.
I would like to thank Prof. Lazaros Gallos and the DIMACS program for giving me the opportunity to research at Rutgers University. I would also like to take Professor Ruobin Gong for her guidance in this project. This work was carried out as a part of the 2024 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186.