Leah's Webpage for DIMACS REU 2022

Weekly Updates

Citations for articles mentioned are at the bottom of this page.

Weeks 1 & 2:

This week we focused on reading literature reviews about data privacy, specifically differential privacy, to gain a better understanding of the subject. The first article was "Differential Privacy: A Primer for a Non-Technical Audience," ₁ which explains the general function of differential privacy, along with its advantages and disadvantages. The second article was "The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census," ₂ which focuses on the advantages and disadvantages of the U.S. Census Bureau adapting a differential privacy approach. This paper focuses on the development of the disclosure avoidance system (DAS), which the Census Bureau hopes will better protect the privacy of individuals in the Census. The third article we explored this week was "Statistical Data Privacy: A Song of Privacy and Utility," ₃ which examines the differences between differential privacy and statistical disclosure control.

Week 3:

This week we read a rebuttal to the Kenny et al. article, titled "Statistical Inference is Not a Privacy Violation." ₄ This post offers a contrasting view of the differential privacy approach suggested by the U.S. Census Bureau. We then discussed potential research questions we would be interested in exploring.

Week 4:

To better understand the algorithm used by the DAS, we read "The 2020 Disclosure Avoidance System TopDown Algorithm." ₅ This paper discusses the processed administered by the DAS. We also examined "Transparent Privacy is Principled Privacy" ₆ as an accompanying read to better understand the importance of transparency and its role in differential privacy.

Week 5:

This week, we attempted to replicate the figures presented in the Kenny et al. article by using the replication material provided on Harvard Dataverse, titled "Replication Data for: The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census" ₇. We successfully reproduced Figures 1, 2, 5, and a few others that were produced by the authors but not included in the paper. We then decided our overall research goal for the rest of the summer. Since the U.S. Census Bureau did not release the 2010 Census before post-processing and following noise injection, our goal is to simulate noisy data using the 2010 Census and compare it to the post-processed Census to determine whether the biases listed in the Kenny et al. article are truly due to post-processing or if they are simply a consequence of injecting noise.

Week 6:

This week, we created a GitHub repository to organize and keep track of all of our code and progress. We then replicated Figures 1 and 2 in the Kenny et al. article using the datasets provided on the IPUMS website ₈ which includes all iterations of the Census DAS demonstration data files. While each data point in Figures 1 and 2 represent precincts, the IPUMS data files only organizes data by state, county, tract groups, tracts, and blocks. We found that datasets for blocks contained many zeroes as values, so we focused on tracts. We then created figures that plotted the same parameters as Figures 1 and 2 in the Kenny et al. article to see if we could replicate their figures using the IPUMS datasets.

Week 7:

We had planned to spend this week producing noisy measurements, reproducing our figures, and comparing them to the figures using the DAS demonstration data. However, as we further analyzed the figures presented in the Kenny et al. article, we noticed that some of the figures, specifically Figures 1 and 2, were misleading. These figures were meant to illustrate the difference in Census values between the post-processed Census and the swapped 2010 Census, which was taken as the ground truth. However, rather than plotting the true difference in values, they plotted the fitted difference/error. The fitted error was produced by fitting a generalised additive model (gam) and plotting the outputs of that model against either non-white percentage (Figure 1) or Herfindahl-Hirschman Index (HHI), which is a measure of diversity. We were suspicious that plotting the fitted model would be misleading, especially because the gam used non-white percentage and HHI as predictor variables. After noticing this, my research group and I decided to put a hold on producing noisy measurements, and we set out to determine the extent to which this choice in plotting affected the overall results of the study. We first reproduced Figures 1 and 2 by plotting the true error in place of the fitted error. We found that these new graphs were similar to the original figures, but they were less intense in shape. They also more closely resembled our figures that were created using the IPUMS demonstration data, portraying that the trends shown in the figures are consistent among other geographic levels, specifically tracts.

Week 8:

Now that we reproduced the Kenny et al. figures to allow for better comparison, we resumed our goal of creating noisy measurements. Since the DAS injects Guassian noise into the data, as explained in the Abowd et al. article, we worked on writing code that injects Gaussian noise and reproduces our figures using the noisy measurements in place of the post-processed ones. We also started brainstorming analyses we could do to further explore the effect of plotting fitted error in place of the true error in the original Kenny et al. figures. We started by pertrubing the error so that it became disconnected from the other data. The purpose of this was to eliminate any trends that were potentially present. Then, we plotted the fitted error using this pertrubed error to see whether trends are created where they should not be.

Week 9:

This week, being our last week, was dedicated to finishing up my presentation on our research project, as well as writing up a paper to summarize our findings. We plan to continue working on this research project through the Fall 2022 semester.

Citations

1: Wood, Alexandra, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, et al. (2018) Differential Privacy: A Primer for a Non-Technical Audience. Vanderbilt Journal of Entertainment & Technology Law 21 (1): 209.

2: Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T., Simko, T., & Imai, K. (2021). The Use of Differential Privacy for Census Data and its Impact on Redistricting: The Case of the 2020 U.S. Census. Science Advances, 7(41). https://doi.org/10.1126/sciadv.abk3283

3: Slavkovic, A., & Seeman, J. (2022). Statistical Data Privacy: A song of privacy and utility. arXiv.org. Retrieved June 10, 2022, from https://doi.org/10.48550/arxiv.2205.03336

4: Bun, M., Desfontaines, D., Dwork, C., Naor, M., Nissim, K., Roth, A., Smith, A., Steinke, T., Ullman, J., & Vadhan, S. (2021, June). Statistical Inference is Not a Privacy Violation. DifferentialPrivacy.org. Retrieved June 17, 2022, from https://differentialprivacy.org/inference-is-not-a-privacy-violation/

5: Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., & Zhuravlev, P. (2022, April 19). The 2020 Census Disclosure Avoidance System TopDown Algorithm. arXiv.org. Retrieved June 27, 2022, from https://arxiv.org/abs/2204.08986

6: Gong, R. (2022). Transparent Privacy is Principled Privacy. Harvard Data Science Review, Special Issue 2. https://doi.org/10.1162/99608f92.b5d3faaa

7: Kenny, Christopher T., Kuriwaki, Shiro, McCartan, Cory, Rosenman, Evan, Simko, Tyler & Kosuke, Imai. 2021. "Replication Data for: The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census." Harvard Dataverse V4. https://doi.org/10.7910/DVN/TNNSXG

8: David Van Riper, Tracy Kugler, and Jonathan Schroeder. IPUMS NHGIS Privacy-Protected 2010 Census Demonstration Data, version 20210428_12-2 [Database]. Minneapolis, MN: IPUMS. 2021. https://www.nhgis.org/privacy-protected-demonstration-data