DIMACS
DIMACS REU 2023

General Information

me
Student: Iris Chang
Office: 417 CoRE Building
School: Columbia University
E-mail: ijc@columbia.edu
Mentor: Pierre Bellec

Project Description

Optimization, learning and high-dimensional macroscopic limits:
The last decade has seen the emergence of new phenomena where complex statistical learning problems such as high-dimensional regression and classification can be accurately summarized by simple systems of equations. These simple systems of equations characterize the high-dimensional limit of the statistical learning problem at hand and provide new insights on regularization and the choice of statistical estimators in high-dimension. The project will explore problems in this line of though, requiring and developing skills in probability, statistical/machine learning, numerical programming and computational linear algebra (From DIMACS Website).


Weekly Log

Week 1:
I worked on understanding logistic regression and then learned how to use Python to construct a logistic regression model from a given dataset. The first section of my project aims to understand and replicate results from Section 4 of this paper (Salehi et al. 2019), so I began reading the relevant sections and testing out how I can generate a dataset as described in the paper. I also made this website!
Week 2:
At the beginning of the week, I presented to the REU group about my project and goals for the summer (linked below!). Then, I continued working on replicating the figure and was able to successfully produce the empirical results of Figure 1 described above. Afterwards, I worked on the theoretical result which I will hopefully finish the beginning of next week. So far, I think my understanding of logistic regression and also my comfortability with Python has definitely improved most.
Week 3:
I was able to successfully reproduce the results from Figure 1c of the 2019 paper I have been working with. I continued to optimize my code to make it run more efficiently. Alongside this, I delved in to the proof for Theorem 1 in the same work to understand how they arrived to the system of six equations (Eqns 6 on page 4). This so that hopefully I may be able to apply a similar process to the next part of my project focused on bagging.
Week 4:
I looked at a more generalized notation of the same system of six equations that were found in Theorem 1 that I have been looking at. I spent the first couple days of the week continuing to work through the proof of Theorem 1 and connect it to the "teacher-student scenario" described here. Then, I looked at how the replica method, which is typically used in statistical physics, can be applied to this problem as is described in Appendix A of this 2022 paper by Loureiro et al. This takes the teacher-student model and applies the replica model to reach a similar result as in Theorem 2 of Salehi et al.'s 2019 paper.
Week 5:
We have pivoted the focus of the project to using the replica trick to characterize the performance of a bagging model. Therefore, I aimed to better understand the replica trick and also connected the two notations from last week more definitively. I also began to think about actually applying this method to the larger problem.
Week 6:
I continued my work using the replica trick to find equations that characterize the performance of the L2 regularized regression model with bagging. I essentially made a first attempt and will continue to correct my model next week.
Week 7:
The first half of the week, I continued with what I was doing in the previous week but with a few simplifictions (log reg model parameter = [0, ..., 0]). After meeting with my mentor, we decided to pivot the goal of my project back to the 2019 Salehi et al. paper. I am now looking at the unregularized bagging setting where the response variable y gives no information about the predictor variable x.
Week 8:
I continued work on the new problem mentioned in the previous week's entry. For this, I was able to successfully show a single term approximating the correlation of the estimators of the two subsets. I also worked on my final presentation throughout the week, which delivered it on Thursday, and began working on my final write up.
Week 9:
Throughout this week, I worked on my final write up of my work as well as the reflections of my time in the program. It was truly a very fulfilling and personally fruitful nine weeks, and I am thankful to have had the opportunity to participate!

Presentations and Paper


Acknowledgements

Thank you to Pierre C. Bellec for his mentorship and generosity. Also, this work was carried out as a part of the 2023 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186