Sumi Vora, DIMACS REU 2023

Weekly Logs

Week 1

May 31-June 2

The first week was primarily orientation. I settled into my apartment, attended the orientation lecture on Wednesday morning, and I began reading through some literature related to my project, which is about physics-informed multifidelity learning. Since the project focuses on incremental sampling techniques, I also read about asynchronous Bayesian optimization, which is an optimization technique that allows us to evaluate multiple models simultaneously and intelligently select the next set of solutions to evaluate. This is useful in PIML because model evaluation is often very computationally expensive. I also made an introductory presentation on my topic, which will be presented on Monday.

Week 2

June 5-June 9

This week moved a little bit slowly for me. I met with my mentor for the first time on Monday, and sent me some data to write a neural network. A glance through the data shows that it has to do with printing. We are given the nozzle-to-plate distance, extruder speed, and filament feed rate, and the average line width, and we want to train a model to predict the average line width. I trained a neural network and did some visualizations of the data. I also explored some python libraries for hyperparameter tuning for my neural network in order to improve the model even more. I continued reading more on PIML and I got a better grasp of multifidelity learning, as well as specific algorithms for asynchronous batch Bayesian sampling (Thompson's sampling and fantasizing).

Week 3

June 12-June 16

My mentor had a busy schedule and wasn't able to meet this week, so I worked on the printer dataset even more. I fit an SVR and I learned how to do hyperparameter tuning for the SVR as well. This was the first time that I have implemented an SVR model, so it was fun to see it in motion. I also did simple linear regression and experimented with regularization terms, but the SVR performed the best out of all of the models. I also attended several talks from the Modern Techniques in Graph Algorithms workshops. I understood very little of the content, but I am interested in graph theory and I have done an extensive project on graph neural networks, so it was interesting to see more advanced techniques. I plan to rewatch a few of the lectures and see if I can understand more of them next week.

Week 4

June 19-June 23

This week was a little hectic, but I am ultimately very happy about the outcomes of the week. On Monday, I met with my mentor, and he suggested that I try synbolic regression for the printer dataset. I had never heard of symbolic regression, so I read a few articles and watched some YouTube videos. Explainable AI is one of my interests, so it was cool to see how it can be implemented in regression tasks. Ultimately, however, I opted to change course and move to a new project focused on logistic regression in high dimensions. I met with my new mentor on Thursday, and I am optimistic about the new research that I will be doing in this area. On Friday, I got started reading the paper that he had assigned me, with the goal of recreating one of the figures from the paper. I also attended the Data Science Bootcamp lectures on Tuesday and Wednesday. While they were interesting, I decided to direct more time to reading about logistical regression, especially because my previous coursework in data science had already introduced me to many of the concepts discussed in the bootcamp.

Readings

Hao, Z., Liu, S., Zhang, Y., Ying, C., Feng, Y., Su, H., & Zhu, J. (2023). Physics-Informed Machine Learning: A Survey on Problems, Methods and Applications. arXiv preprint arXiv:2211.08064 [cs.LG].
M. Penwarden, S. Zhe, A. Narayan, and R. M. Kirby, Multifidelity modeling for Physics-Informed Neural Networks (PINNs). Journal of Computational Physics, vol. 451, p. 110844, Feb. 2022. [Online].
Hebbal, A., Brevault, L., Balesdent, M. et al.Multi-fidelity modeling with different input domain definitions using deep Gaussian processes. Struct Multidisc Optim 63, 2267–2288 (2021).
J. P. Folch, R. M. Lee, B. Shafei, D. Walz, C. Tsay, M. van der Wilk, and R. Misener, Combining Multi-Fidelity Modelling and Asynchronous Batch Bayesian Optimization. arXiv preprint arXiv:2211.06149 [cs.LG], 2023.

Project #2: Optimization, learning and high-dimensional macroscopic limits

Mentor: Professor Pierre Bellec

Abstract: The last decade has seen the emergence of new phenomena where complex statistical learning problems such as high-dimensional regression and classification can be accurately summarized by simple systems of equations. These simple systems of equations characterize the high-dimensional limit of the statistical learning problem at hand and provide new insights on regularization and the choice of statistical estimators in high-dimension. The project will explore problems in this line of though, requiring and developing skills in probability, statistical/machine learning, numerical programming and computational linear algebra.

Weekly Logs

Week 5

June 26-June 30

This week, I became better acquainted with my new project. We want to define a curve for the existence of the maximum likelihood estimate (MLE) in high dimensional logistic regression parametrized by the ratio $\kappa = p/n$, where $p$ is the number of features and $n$ is the number of observations, and $\gamma = Var(\textbf{x}_i^T\beta)$ where $x_i$ is the $i$-th row of the feature matrix $X$ and $\beta$ is the logistic regression coefficients. The curve is already defined for logistic regression in two classes, and we want to find a similar curve for multinomial logistic regression. This week, I worked on recreating Figure 2a from this paper, which shows both empirical data and the theoretical curve of the phase transition for the existence of the MLE for binomial logistic regression. After reading a few sections of textbooks on logistic regression, I was able to generate an approximate figure, but it still looks wonky. I will work on it more next week.

Week 6

July 3-July 7

I continued working on recreating Figure 2a from this paper, and I was finally able to recreate the image after making some tweaks to evaluating the existence of the MLE. The original paper uses linear programming methods to find whether maximizing $\sum_{i=1}^ny_i(\textbf{X_i}^T\beta)$ has a solution given certain parameters in order to determine whether the MLE exists. However, we can also just test whether $y_i(\textbf{x}_i^T\beta) \geq 0$ for all $x_i$. This makes the code run faster, and it gives the correct result. I was also able to recreate the theoretical curve defined in the paper to show that it matches the empirical results.

Week 7

July 10-July 14

This week, I shifted focus towards evaluating the existence of the MLE for multinomial logistic regression. I started the week by reading this paper which discusses the existence of the MLE for Gaussian mixture models, and I also read some portions of a statistical learning textbook in order to learn the probability distributions for multi-class logistic regression. Following my recreation of the binomial logistic regression phase transition, I tried to show an empirical phase transition for logistic regression with three classes. I once again ran into an issue with testing whether the MLE exists, as my previous method of testing whether a hyperplane exists between the two classes only works in two dimensions. Instead, my mentor advised me to compare the predicted values and the actual values of the simulated data to test whether they were exactly the same. If this is the case, then the two classes are linearly separable and the MLE does not exist. For logistic regression for 3, 4, and 5 classes, I noticed that the phase transition did exist and followed a similar curve compared to the curve with 2 classes, but the curve seems to shift left. Next week, I will try to use the proof of the theoretical phase transition for binomial logistic regression to explain these results.

Week 8

July 17-July 21

This was the last week of the REU before a group of us left for a combinatorics workshop in Prague. I spent the beginning of the week finalizing my plots and preparing my presentation, which I gave on Thursday, July 20, and the latter half of the week was spent packing and briefly reviewing my combinatorics problem sets from last year in order to refresh my memory.

Week 9

July 24-July 28

This was the first week in Prague! In the mornings, we had introductory lectures on probablistic graph theory, combinatorial geometry, visibility problems, and algorthmic game theory. On Thursday, we had student presentations, and I gave an expository talk on graph neural networks, as I had done a project on them this past semester. In our free time, we traveled around Prague and worked on our final papers.

Readings

Emmanuel J. Candes and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv preprint arXiv:1804.09753 (2018).
Border, KC. An Introduction to The Multivariate Normal Distribution. Department of Mathematics, CalTech (2021).
Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., & Zdeborová, L. (2021). Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions. arXiv preprint arXiv:2106.03791 [stat.ML].
Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2017).

Sumi's REU Page

Project #1: Intelligent sampling for physics-informed multifidelity learning

Weekly Logs

Readings

Project #2: Optimization, learning and high-dimensional macroscopic limits

Weekly Logs

Readings