DIMACS
DIMACS REU 2024

General Information

me
Student: Nigel Seymour
Office: 440 CoRE Building
School: University of Maryland- Baltimore County
E-mail: nigels1@umbc.edu
Mentor: Anand Sarwate

Project Description

Differential Privacy and Visualization:
Differential privacy is mathematical framework for quantifying the privacy risks when computing using sensitive data about individuals. This is a statistical/probabilistic notion of privacy which can be used to generate privacy-protecting summaries of data. The 2020 US Census used differential privacy to publishing results from the 2020 decennial census and many government agencies are trying to learn how to use differential privacy for their work. In many cases, private data is summarized through visualizations and descriptive statistics. Much of the work on differential privacy has focused on machine learning and data publishing. The student on this project will survey existing work in differential privacy and private visualization, choose a type of visualization, and evaluate the methods on data sets. The goal is to implement different data visualization techniques and see how privacy affects our qualitative (visual) understanding of the structure of data.


Weekly Log

Week 1:
On my first day I moved into the apartment and met other DIMACS peers and played some games with them. The following day I met with my PI and we discussed the project and our goals for the summer. The first goal is to learn different ways to measure correlation and then implement/test them on data. But before I can learn that I have to read up on differential privacy, and I also read a few chapters of a Probability textbook. Over the weekend I worked on my presentation for the following week. I also explored the campus by scooter :)
Week 2:

In my second week I finally met with my mentor in person rather than on zoom. This week I learned about different types of correlation, including the “Big Three”. Pearson’s correlation coefficient, Spearman’s, and Kendall’s. Pearson’s correlation is the most common way of measuring a linear correlation. The coefficient is a number between -1 and 1 that measures the direction and strength of the relationship between two variables. If the coefficient is between 0 and 1 then it is a positive correlation, meaning that if one variable increases that the other variable will also increase. If the coefficient is between 0 and -1 then it is an anti correlation, which means the opposite. If the Pearson correlation coefficient is - then there is no relationship between the variables.

Pearson's coefficiant: r = ∑(xi-mx)(yi-my) (xi-mx) 2 (yi-my) 2

Now for the fun stuff. On Saturday my mentor invited me out to go hiking with him and some of the people in his lab. I loved the sights I got to see and we got great pictures out of it. Then the next day me and a few other people from DIMACS took the train to Brooklyn. In Brooklyn we crossed the famous bridge and we went to the flea market. I definitely want to visit New York again, hopefully my future self will go again.

Week 3:
This week, I attended several lectures from the "Current Trends in Mathematics" series held by DIMACS. The talks were highly engaging and covered topics ranging from the relative complexity of mathematical problems to Lagrangians and Markov numbers. In terms of my research, my mentor and I decided it was time to implement the correlations I have been studying. He mentioned having a dataset on HMDA. With that in mind, I decided to use this dataset to investigate whether there is a correlation between mortgage approvals and race. To do this, I had to download specific applications and learn how to manage this extensive dataset.
Week 4:
This week in Jupyter Lab, I made significant progress analyzing Census data. I focused on extracting key demographics like age, marriage status, education, sex, wages, and race. To visualize these factors, I created informative tables and graphs, allowing for a clear understanding of the data distribution. Next, I delved deeper into the relationship between wages and age. By creating my own Pearson correlation coefficient, I discovered a weak negative correlation of 0.006. I also double checked this by using the Pearson COrrelation in the pandas library. Finally, I created a scatter plot graph using Matplotlib that compared the wage distribution between males and females. Overall, this week's work provided valuable insights into the Census data and further strengthened my data analysis capabilities.
Week 5:
This week, I tackled a new challenge in analyzing the sensitivity of Pearson correlation. I explored how removing a single data point (agent) affects the correlation coefficient. This involved manual calculations and valuable guidance from my mentor's graduate student.

Acknowledgments

I would like to express my deepest gratitude to Dr. Sarwate for his invaluable guidance and support throughout my research. My thanks also go to the DIMACS REU program for providing a platform to conduct and develop my research skills. Additionally, I am grateful for the financial support provided by the National Science Foundation under grant number CNS-2150186, which made this research possible. Thank you all for your contributions to my academic and professional growth.