Weekly Logs
Week 1
May 31-June 2
The first week was primarily orientation. I settled into my apartment, attended the orientation lecture on
Wednesday morning, and I began reading through some literature related to my project, which is about physics-informed
multifidelity learning. Since the project focuses on incremental sampling techniques, I also read about asynchronous
Bayesian optimization, which is an optimization technique that allows us to evaluate multiple models simultaneously and intelligently select the next set
of solutions to evaluate. This is useful in PIML because model evaluation is often very computationally expensive. I also made
an introductory presentation on my topic, which will be presented on Monday.
Week 2
June 5-June 9
This week moved a little bit slowly for me. I met with my mentor for the first time on Monday, and sent me some
data to write a neural network. A glance through the data shows that it has to do with printing. We are given the
nozzle-to-plate distance, extruder speed, and filament feed rate, and the average line width, and we want to train a
model to predict the average line width. I trained a neural network and did some visualizations of the data. I also
explored some python libraries for hyperparameter tuning for my neural network in order to improve the model even more.
I continued reading more on PIML and I got a better grasp of multifidelity learning, as well as specific algorithms for
asynchronous batch Bayesian sampling (Thompson's sampling and fantasizing).
Week 3
June 12-June 16
My mentor had a busy schedule and wasn't able to meet this week, so I worked on the printer dataset even more. I fit an SVR and I learned
how to do hyperparameter tuning for the SVR as well. This was the first time that I have implemented an SVR model, so it was
fun to see it in motion. I also did simple linear regression and experimented with regularization terms, but the SVR performed the
best out of all of the models. I also attended several talks from the Modern Techniques in Graph Algorithms workshops. I understood very
little of the content, but I am interested in graph theory and I have done an extensive project on graph neural networks, so it was interesting
to see more advanced techniques. I plan to rewatch a few of the lectures and see if I can understand more of them next week.
Week 4
June 19-June 23
This week was a little hectic, but I am ultimately very happy about the outcomes of the week. On Monday, I met with my mentor, and he suggested
that I try synbolic regression for the printer dataset. I had never heard of symbolic regression, so I read a few articles and watched some YouTube videos.
Explainable AI is one of my interests, so it was cool to see how it can be implemented in regression tasks. Ultimately, however, I opted to change course
and move to a new project focused on logistic regression in high dimensions. I met with my new mentor on Thursday, and I am optimistic about the new research
that I will be doing in this area. On Friday, I got started reading the paper that he had assigned me, with the goal of recreating one of the figures from the paper.
I also attended the Data Science Bootcamp lectures on Tuesday and Wednesday. While they were interesting, I decided to direct more time to reading about logistical regression,
especially because my previous coursework in data science had already introduced me to many of the concepts discussed in the bootcamp.
Project #2: Optimization, learning and high-dimensional macroscopic limits
Mentor: Professor Pierre Bellec
Abstract: The last decade has seen the emergence of new phenomena where complex statistical
learning problems such as high-dimensional regression and classification can be accurately summarized
by simple systems of equations. These simple systems of equations characterize the high-dimensional limit
of the statistical learning problem at hand and provide new insights on regularization and the choice of
statistical estimators in high-dimension. The project will explore problems in this line of though,
requiring and developing skills in probability, statistical/machine learning, numerical programming and
computational linear algebra.
Weekly Logs
Week 5
June 26-June 30
This week, I became better acquainted with my new project. We want to define a curve for the existence of the maximum likelihood estimate (MLE) in high dimensional logistic regression parametrized by the ratio $\kappa = p/n$, where $p$ is the number of features and $n$ is
the number of observations, and $\gamma = Var(\textbf{x}_i^T\beta)$ where $x_i$ is the $i$-th row of the feature matrix $X$ and $\beta$ is the logistic regression coefficients.
The curve is already defined for logistic regression in two classes, and we want to find a similar curve for multinomial logistic regression. This week, I worked on recreating
Figure 2a from this
paper, which shows both empirical data and the theoretical curve of the phase transition for the existence of the MLE for binomial logistic regression.
After reading a few sections of textbooks on logistic regression, I was able to generate an approximate figure, but it still looks wonky. I will work on it more next week.
Week 6
July 3-July 7
I continued working on recreating Figure 2a from this
paper, and I was finally able to recreate the image after making some tweaks to evaluating the existence of the MLE. The
original paper uses linear programming methods to find whether maximizing $\sum_{i=1}^ny_i(\textbf{X_i}^T\beta)$ has a solution given certain parameters in order to determine whether the MLE exists. However, we can also just
test whether $y_i(\textbf{x}_i^T\beta) \geq 0$ for all $x_i$. This makes the code run faster, and it gives the correct result. I was also able to recreate the theoretical curve defined in the paper to show that it matches the
empirical results.
Week 7
July 10-July 14
This week, I shifted focus towards evaluating the existence of the MLE for multinomial logistic regression. I started the week by reading this
paper which discusses the existence of the MLE for Gaussian mixture models, and I also read some portions
of a statistical learning textbook in order to learn the probability distributions for multi-class logistic regression. Following my recreation of the binomial logistic
regression phase transition, I tried to show an empirical phase transition for logistic regression with three classes. I once again ran into an issue with testing whether the MLE
exists, as my previous method of testing whether a hyperplane exists between the two classes only works in two dimensions. Instead, my mentor advised me to compare the predicted values and the actual values
of the simulated data to test whether they were exactly the same. If this is the case, then the two classes are linearly separable and the MLE does not exist. For logistic regression for 3, 4, and 5 classes,
I noticed that the phase transition did exist and followed a similar curve compared to the curve with 2 classes, but the curve seems to shift left. Next week, I will try to use the proof of the theoretical phase
transition for binomial logistic regression to explain these results.
Week 8
July 17-July 21
This was the last week of the REU before a group of us left for a combinatorics workshop in Prague. I spent the beginning of the week
finalizing my plots and preparing my presentation, which I gave on Thursday, July 20, and the latter half of the week was spent packing and
briefly reviewing my combinatorics problem sets from last year in order to refresh my memory.
Week 9
July 24-July 28
This was the first week in Prague! In the mornings, we had introductory lectures on probablistic graph theory, combinatorial geometry,
visibility problems, and algorthmic game theory. On Thursday, we had student presentations, and I gave an expository talk on graph neural networks, as
I had done a project on them this past semester. In our free time, we traveled around Prague and worked on our final papers.