DIMACS 2021-Emily Thompson

Name:	Emily Thompson
Email:	thompsoe@southwestern.edu
Home Institution:	Southwestern University
Mentor:	Dr. Shashanka Ubaru
Project:	Predicting Dissolution Rates of Volcanic Glass Using Graph Neural Networks

About My Project

We are examining mutations of polynomials by studying cluster algebras and toric varieties.

Abstract: Graph neural networks (GNNs) in recent years have gained popularity due to the increasing need for machine learning models to accommodate graph structured data. Interdisciplinary fields such as material informatics use graph neural networks to conduct research on materials that are otherwise difficult to analyse. One of these materials, volcanic glass, are used in storing nuclear waste due to their durability in extreme conditions. This durability, measured by its dissolution rate, is difficult to determine in a traditional lab environment due to the extensive time and resources needed to collect and analyse samples. I seek to implement a GNN model that performs regression in order to predict the dissolution rate of ten different types of volcanic glass. In addition to implementing the GNN, I explore varying methods of optimizing the performance of the model. Results demonstrate that unknown glass materials and their accompanying dissolution rate can accurately be determined the GNN in a short period of time.

Weekly Summary

Week 1

This week was the first week of the program! After orientation, I met my mentor Dr. Ubaru and discussed the specifics of the research I will be doing this summer. After deciding that I will be working on, I spent the week reading papers about graph neural networks, variational autoencoders, graph partitioning, and the paper my research will expand upon. On Friday, I met with my mentor to go over each of the papers and help me understand the material at a high level. At the end of the week, I started on creating the presentation about the basics of my research.

Week 2

At the beginning of this week I gave a short presentation about the research I'll be doing this summer. It went well, and it was exciting for me to see the variety of research being done. My mentor and I were most interested in the research projects related to graphs! After the presentation I focused on finishing up some additional papers I was assigned as well as getting the python code up and running. There were some issues getting the code to run, but eventually I was able to successfully get the code to complie! At the end of the week, my mentor and I met to dicuss in detail what the code does and how it was implemented. Since Python is not my strongest programming language and the code I am working with contains thousands of lines, it was a long meeting. At the end of the meeting I was tasked with becoming more comfortable with PyTorch and Python before our next meeting at the beginning of the week.

Week 3

This week I focused on improving my understanding of my mentor's code by implementing less complex versions and playing around with parameters to improve their accuracy. Particularly I implemented a simple GNN, a GCN, and a VAE (variational autoencoder). I tested accuracy using public datasets such as cora, citeseer, and Fashion MNIST. Through this I was able to gain a much better understanding of how all three of these implementations would work in my mentor's code and was ready to begin on modifying the code I was given.

Week 4

In a suprising twist, I completely changed the focus of my research! This week my mentor and I met with Dr. Jie Chen at MIT-IBM Watson about collaborating on one of his current research projects pretaining to using machine learning to predict the dissolution rate of different types of glasses based on their atomic structure. This week I focused on learning some of the background material and starting to work on writing code to accomplish this. To start, my mentors and I decided to use PyTorch Geometric due to it's versatility to create a Graph Isomorphism Network (GIN) for regression (using regression to predict the dissolution rate). While it is scary to change to a new research topic in the middle of the program, I'm excited to see where this new research will take me!

Week 5

This week I focused on trying to extract all the data from the files that were given to me. While it sounds easy, the atomic structure of the glass is stored in a Protein Data Bank (pdb) file, a file format that has a reputation of being difficult to read. The pdb file stores the structure of the glass by storing the x,y,z coordinates for each atom in the glass in the first half of the file, and the indices of the atoms that are connected to each other in the second half of the file. Additionally, there are few resources out there on how to use pdb files for machine learning, so I was mostly on my own. First I tried using packages in Python that specialized in biology such as BioPython and BioPandas. While both of these packages had functions that could extract data from a pdb file. Sadly both of these packages were not optimal or didn't extract correctly, so instead I went with the simpler option of just using open() to open the file and place everything in the file into a list. Now that the pdb file is in a list, I started writing some basic functions to extract the features (which for now are just the coordinates), and the connections between the atoms to construct graphs later on (aka the edge indexes).

Week 6

This week I finished up my first versions of the functions that create the input X (aka the features) and the edge_index from the pdb file. Sadly when trying to use these functions in PyTorch Geometric, I ran into several issues due to how Geometric is structured. While Geometric is easy to use for premade datasets like MNIST and mutag, it's much more tricky to use with a custom dataset. After alot of debugging and reading up on documentation on Geometric, my mentors advised me to split the atoms into subgraphs. How this works is to imagine the atoms in the glass are one large cube, with the goal to break this cube into smaller cubes that don't overlap. While it sounds simple in principle, sadly the way the pdb file is formatted makes this tricky. One issue is that the indices of the atoms are not listed in any particular order. So in practice this means that atom #1 is closet to atom #1024 instead of atom #2. Additionally, some atoms will be connected to other atoms in order cubes (we don't want this!), so we have to ensure that any edges that connect atoms to other atoms outside of their respective cubes are disconnected. I spent the week trying to implement this, though I did not have a working version by the end of the week.

Week 7

This week I successfully implemented the first version of the subgraph function, which generates 64 fixed subgraphs for each file. I also started rewriting some functions in order to reduce runtime. While this fixed subgraph function can be used for the model to perform regression, the testing loss indicated that more improvements needed to be made. My mentors suggested that I should generate more subgraphs per file in addition to randomizing the positioning of each cube. Lastly, I implemented the final version of the 12 features, with the first 3 features being the coordinates of the graph, and the other 9 composed of one shot vectors indicating the atom.

Week 8

This week I worked on reworking how the subgraphs were generated, focusing instead on generating 200 random, overlapping subgraphs. After this, I fixed how the subgraphs were divided into training and testing sets. In the final design, two different glass types were set aside for the testing set (total of 6 pdb files). This was to ensure that the model didn't see the dissolution rates for all 10 glass types while training. After successfully implementing this, I ran the model, getting an average testing loss of 0.35 in the last 100 epochs.

Week 9

This week I focused on collecting final results for optimizing the glass dissolution model, creating the final presentation, and writing the final report. These past 9 weeks went by so fast!

References & Links

Here are some relevant resources:

How Powerful are Graph Neural Networks?
Graph Attention Networks
Semi-Supervised Classification with Graph Convolutional Networks
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
A Practical Guide to Graph Neural Networks
Machine learning in materials informatics: recent applications and prospects
The dissolution rates of natural glasses as a function of their composition at pH 4 and 10.6, and temperatures from 25 to 74°C

Funding & Acknowledgements

I would like to thank Dr. Shashanka Ubaru at IBM Watson and Dr. Jie Chen at MIT-IBM Watson for their guidance throughout the research process. I also would like to thank Rutgers University and DIMACS for hosting the program. This work was funded by NSF grant CCF-1852215.

General Information