Iris Chang DIMACS REU 2023

DIMACS REU 2023

General Information

Student:	Iris Chang
Office:	417 CoRE Building
School:	Columbia University
E-mail:	ijc@columbia.edu
Mentor:	Pierre Bellec

Project Description

Optimization, learning and high-dimensional macroscopic limits:
The last decade has seen the emergence of new phenomena where complex statistical learning problems such as high-dimensional regression and classification can be accurately summarized by simple systems of equations. These simple systems of equations characterize the high-dimensional limit of the statistical learning problem at hand and provide new insights on regularization and the choice of statistical estimators in high-dimension. The project will explore problems in this line of though, requiring and developing skills in probability, statistical/machine learning, numerical programming and computational linear algebra (From DIMACS Website).

Weekly Log

Week 1:: I worked on understanding logistic regression and then learned how to use Python to construct a logistic regression model from a given dataset. The first section of my project aims to understand and replicate results from Section 4 of this paper (Salehi et al. 2019), so I began reading the relevant sections and testing out how I can generate a dataset as described in the paper. I also made this website!
Week 2:: At the beginning of the week, I presented to the REU group about my project and goals for the summer (linked below!). Then, I continued working on replicating the figure and was able to successfully produce the empirical results of Figure 1 described above. Afterwards, I worked on the theoretical result which I will hopefully finish the beginning of next week. So far, I think my understanding of logistic regression and also my comfortability with Python has definitely improved most.
Week 3:: I was able to successfully reproduce the results from Figure 1c of the 2019 paper I have been working with. I continued to optimize my code to make it run more efficiently. Alongside this, I delved in to the proof for Theorem 1 in the same work to understand how they arrived to the system of six equations (Eqns 6 on page 4). This so that hopefully I may be able to apply a similar process to the next part of my project focused on bagging.
Week 4:: I looked at a more generalized notation of the same system of six equations that were found in Theorem 1 that I have been looking at. I spent the first couple days of the week continuing to work through the proof of Theorem 1 and connect it to the "teacher-student scenario" described here. Then, I looked at how the replica method, which is typically used in statistical physics, can be applied to this problem as is described in Appendix A of this 2022 paper by Loureiro et al. This takes the teacher-student model and applies the replica model to reach a similar result as in Theorem 2 of Salehi et al.'s 2019 paper.
Week 5:: We have pivoted the focus of the project to using the replica trick to characterize the performance of a bagging model. Therefore, I aimed to better understand the replica trick and also connected the two notations from last week more definitively. I also began to think about actually applying this method to the larger problem.
Week 6:: I continued my work using the replica trick to find equations that characterize the performance of the L2 regularized regression model with bagging. I essentially made a first attempt and will continue to correct my model next week.
Week 7:: The first half of the week, I continued with what I was doing in the previous week but with a few simplifictions (log reg model parameter = [0, ..., 0]). After meeting with my mentor, we decided to pivot the goal of my project back to the 2019 Salehi et al. paper. I am now looking at the unregularized bagging setting where the response variable y gives no information about the predictor variable x.
Week 8:: I continued work on the new problem mentioned in the previous week's entry. For this, I was able to successfully show a single term approximating the correlation of the estimators of the two subsets. I also worked on my final presentation throughout the week, which delivered it on Thursday, and began working on my final write up.
Week 9:: Throughout this week, I worked on my final write up of my work as well as the reflections of my time in the program. It was truly a very fulfilling and personally fruitful nine weeks, and I am thankful to have had the opportunity to participate!

Presentations and Paper

First Presentation

Final Presentation

Final Paper

Acknowledgements

Thank you to Pierre C. Bellec for his mentorship and generosity. Also, this work was carried out as a part of the 2023 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186