General Information

me
Student: Thomas Chen
Mentor: Ruobin Gong
School: University of California, Berkeley
E-mail: tochen920@berkeley.edu
Project: Data Privacy and Applied Social Science Research

Project Description

We live in a digital world in which everyone produces an explosive amount of personal data. When these data translate into open-source databases accessible for research, they can be an invaluable resource in advancing science, fostering public knowledge, and improving research reproducibility. The richer, more accurate, and more detailed the data are, the more informative they can be for the scientific purpose. At the same time, the custodian of these data have more responsibility in ensuring the confidentiality of the individual data contributors. Recent developments in the literature of formal privacy, notably differential privacy, presents a promising, mathematically rigorous framework to conceptualize data privacy protection. The challenge is how to strike the right balance between effective privacy protection and the usability of the data for research purposes.

This project evaluates the potential of modern privacy methods for the various data-driven disciplines, with a focus on the quantitative social sciences including (but not limited to) the behavioral sciences, political science, sociology, and economics. The project asks the following questions: what is the current state and mindset of data privacy protection in the subject matter discipline? For benchmark data products (including important surveys and official databases) what are the current protocols of confidentiality protection, and what are the viable methods and standards moving forward in light of new developments in formal privacy? In what ways can new data privacy standards help promote data sharing, transparency, and open science? How can existing data analysis methods for privacy-protected data apply to concrete use cases?


Weekly Log

Week 1:

This week, I read 2 papers focusing on the current differential privacy methods on synthetic data. We have decided to focus on 2 methods, PrivBayes and DPSyn that seem promising for the PSID dataset, which is what we are focusing on. Additionally, I familarized myself with the PSID dataset and decided to find papers that would be useful to start with implementing one of the differential privacy methods above.


Week 2:

I was able to implement PrivBayes, a method that was described in the differential privacy papers last week. The basic implementation has a lot of moving parts that we are looking more into. I decided to start with some data analysis on the specific PSID dataset I chose to see what we can do to improve the methods.


Week 3:

We explored more data synthesis techniques this week as well as delving deeper into the PrivBayes paper. We were able to figure out a lot of the hyperparameters that we needed to tune for the the data synthesis and preformed experiments to see which combination performed the best in the analysis.


Week 4:

We were able to implement PrivBayes analysis on one of the PSID Studies focusing on intergenerational transfer of wealth and time. Unfortunately, it seemed that the Synthetic data still did not perform well compared to the original data on the analyses in the study. We are looking into ways to improve the similarity by considering data inputation in order to get better results.