DIMACS
DIMACS REU 2015

General Information

Student: Zexi Song
Office: CoRE 537
School: Bard College
E-mail: zs0811 (at) bard (dot) edu
Project: Generating Private Synthetic Data
Mentor: Dr. Anand Sarwate

Project Description

We are gathering more and more digital information about people. As this data becomes more and more detailed, it’s easier and easier to “de-anonymize” or re-identify individuals from their data alone. However, for most applications all we want is population-level statistics.

To ensure individual privacy while at the same time utilize the richness of the data, our approach is to publish “synthetic” data: we learn patterns and correlations in the population and then generate fake individuals/records so that the population statistics are similar. The trick is to learn the patterns and correlations in a privacy-preserving way. The privacy model we will use is differential privacy, a new privacy framework that is rapidly gaining popularity.

The hope is to understand how much privacy is possible through this approach as well as to produce a toolkit that can be used by others to explore synthetic data generation. The work is a blend of probability, randomized algorithms, with some programming/simulation thrown in for good measure.


Weekly Log

Week 1:
  • Met with Dr. Sarwate and the group.
  • Explored literature on differential privacy, probabilistic graphical models and the CLThres algorithm.
  • Gave the first presentation.
Week 2:
  • Continued literature review from Week 1.
  • Worked on the implementation of Chow-Liu algorithm in Python.
  • Studied the connection between sensitivity and differential privacy.
Week 3:
  • Implemented the simulation of Chow-Liu algorithm based on doubly symmetric binary source.
  • Extended the scope of the code to process generic discrete random variables.
  • Worked on the sensitivity of mutual information.
Week 4:
  • Continued working on the sensitivity of mutual information.
  • Selected and pre-processed data from UCI Machine Learning Repository.
Week 5:
  • Proved the sensitivity of the entropy function.
  • Used the sensitivity of the entropy function to give a loose upper bound for the sensitivity of the plug-in estimate of the mutual information function.
  • Worked towards a better upper bound.
Week 6:
  • Proved a tight upper bound for the sensitivity of the plug-in estimate of the mutual information function.
  • Implemented the simulation of differentially private Chow-Liu algorithm in Python, based on the planted tree model.
  • Prepared for the second presentation
Week 7:
  • Evaluated our algorithm on the Adult dataset and recovered a robust tree structure.
  • Gave the second presentation.
Week 8:
  • Worked on the extension of differentially private graphical models to continuous data.
  • Gave the sensitivities of sample mean, sample variance and sample covariance of continuous data.
Week 9:
  • Worked on the sensitivity of continuous mutual information function based on the results from Week 8.
  • Wrote up the final report.

Presentations