||zs0811 (at) bard (dot) edu
||Generating Private Synthetic Data
||Dr. Anand Sarwate
We are gathering more and more digital information about people. As this data becomes more and more detailed, it’s easier and easier to “de-anonymize” or re-identify individuals from their data alone. However, for most applications all we want is population-level statistics.
To ensure individual privacy while at the same time utilize the richness of the data, our approach is to publish “synthetic” data: we learn patterns and correlations in the population and then generate fake individuals/records so that the population statistics are similar. The trick is to learn the patterns and correlations in a privacy-preserving way. The privacy model we will use is differential privacy, a new privacy framework that is rapidly gaining popularity.
The hope is to understand how much privacy is possible through this approach as well as to produce a toolkit that can be used by others to explore synthetic data generation. The work is a blend of probability, randomized algorithms, with some programming/simulation thrown in for good measure.
- Week 1:
- Met with Dr. Sarwate and the group.
- Explored literature on differential privacy, probabilistic graphical models and the CLThres algorithm.
- Gave the first presentation.
- Week 2:
- Continued literature review from Week 1.
- Worked on the implementation of Chow-Liu algorithm in Python.
- Studied the connection between sensitivity and differential privacy.
- Week 3:
- Implemented the simulation of Chow-Liu algorithm based on doubly symmetric binary source.
- Extended the scope of the code to process generic discrete random variables.
- Worked on the sensitivity of mutual information.
- Week 4:
- Continued working on the sensitivity of mutual information.
- Selected and pre-processed data from UCI Machine Learning Repository.
- Week 5:
- Proved the sensitivity of the entropy function.
- Used the sensitivity of the entropy function to give a loose upper bound for the sensitivity of the plug-in estimate of the mutual information function.
- Worked towards a better upper bound.
- Week 6:
- Proved a tight upper bound for the sensitivity of the plug-in estimate of the mutual information function.
- Implemented the simulation of differentially private Chow-Liu algorithm in Python, based on the planted tree model.
- Prepared for the second presentation
- Week 7:
- Evaluated our algorithm on the Adult dataset and recovered a robust tree structure.
- Gave the second presentation.
- Week 8:
- Worked on the extension of differentially private graphical models to continuous data.
- Gave the sensitivities of sample mean, sample variance and sample covariance of continuous data.
- Week 9:
- Worked on the sensitivity of continuous mutual information function based on the results from Week 8.
- Wrote up the final report.