DIMACS
DIMACS REU 2023

General Information

me
Student: Elm Markert
Office: CoRE 448
School: Smith College
E-mail: elm.markert@gmail.com
Project: Genomic Data-guided Computational Modeling of Cancer

Project Description

When cancer cells die, they shed their genetic material into the blood stream. As such, we can use the presence of cancerous DNA in the blood stream to detect tumors. In this project, we are attempting to create a model that will screen for cancer by using the DNA found in patients' blood. However, this process is very complicated and has many variables. As a result, a large part of this project is simply untangling the many interactions and factors that affect human cancer.


Weekly Log

Week 1:
This week, I got my bearings with the project. I met with Dr. De and Dr. Kabiraj. We talked about the biology and technical statistics behind it, and decided which techniques we wanted to use. I'm doing a lot of background reading to wrap my head around the biology, but am also getting my first small data set to begin working on today. I'm going to finish up my slides for my first presentation, and I'm excited to get to work!.
Week 2:
I taught myself a lot of biological concepts and R coding this week which is generally one of my favorite parts of doing research -- the independent learning. I applied these concepts to the data set that I was given at the end of last week, and have found the parts of sequence that we need for the next step of analysis. I also wrote a script to find base frequencies. With all the new code I learned, I also got a lot of practice debugging. My biggest issue so far has been run times. I've never worked with data sets this big before, and they definitely require more patience than previous work I've done. I also have Covid at the moment, which makes progress somewhat slow, but I'm excited to get going again once I'm better.
Week 3:
This week was another week of learning lots of new concepts like figuring out how to use R to write .fasta files and making sequence logos. Learning about position weight matrices (PWMs) along with the sequence logos has been really interesting, and helps me understand some of the literature better. Writing the code to produce the PWMs and sequence logos was really interesting because I learned about a lot of tools in R that I had no idea existed, there's so much out there specifically for working with biological data in R! I also studied up on some new machine learning techniqes. I've never worked with machine learning before, so it's interesting to see how many of the concepts are similar to concepts I've learned in my math, statistics and computer science classes and research.
The most exciting thing for me this week has been getting the run times on my code down a ton! The long run times were frustrating me, so I started doing some pretty in depth research about how R works as a language behind the scenes. It turns out vectors are a whole lot more powerful than I thought they were. I ended up replacing all of my for loops and completely recoding the pipeline I'd been using. This resulted in the run time for a single file going from multiple days to an hour or two. However, writing the fasta files in R is also taking a long time because I'm having a lot of trouble vectorizing the operation. Hopefully it will be quick enough, though!
Week 4:
For the entirety of this project, my computer skills have really been tested and grown. This week I figured out how to do things in the command line and gained more familiarity with bash scripts. I'm working on an even faster method for data processing because the last one, though it had significant improvement, still isn't really fast enough for the volume of data we need to process (which is around 1000 files, quite a lot for my laptop to handle!). I met with Dr. De and Dr. Kabiraj, and we discussed the speed at which the data needs to be processed. We also talked about some of the background biology, and the math behind the upcoming analysis I'll be doing. I'm very excited to get to the analysis.
I've implemented a new tool - bedtools - which Dr. Kabiraj showed me. It makes extracting fasta files much quicker than my R script does! It required a lot of trouble shooting, but now I know a lot more about how Windows Subsystems for Linux work (especially in regards to files), and can interpret the errors Ubuntu outputs with a lot more ease. I'd never used a tool that I had to call from the command line or with a bash script on my own before, so I feel a lot more confident in my ability to do that in the future now. The issue I'm solving now is that while the data wrangling and extracting the fasta files is going quite quickly, the frequency table generation in R is still giving me a lot of trouble with speed, and I've tried a myriad of methods to vectorize or just cut down overhead. As a result, I started learning some C++ today (Friday) because I think that figuring it out will actually be faster than optimizing in R given the sheer size and volume of data we have. I've also started on some of the analysis, though we need to process a lot more data to get to the meat of things.
Week 5:
My C++ script worked! And it's runtime did significantly decrease the total runtime. Now, from the start of data wrangling to producing the final frequency table, each file takes only around 10 minutes. I wrote a bash script to bring together an initial R script to wrangle the raw data, bedtools to get the sequences we needed, and the C++ script. It took a while (until Tuesday) to get the bash script working because I've only ever used bash with a lot of supervision and help before, and don't have much experience. I'm also quite proud that I was able to get the C++ script to work in such a short time given that I knew nothing about the language until Friday and managed to get it running by Monday. It felt so good when the first files went through the whole thing and turned out how they were supposed to.
With this new compilation of tools and scripts, we should be able to process data pretty quickly. That means that we're about to get to the really interesting of the project -- the analysis. I can't overemphasize how excellent that is. The analysis I'm doing now is already so exciting, I'm really enjoying exploring the NMF package in R, although admittedly it's a steep learning curve at the moment. I'm also reading even more papers, specifically on NMF applications in cancer. There's a lot of very interesting research out there. It's also lead me down a rabbit hole of learning about support vector machines (SVMs).
I was hoping to get the new files so I could process them on Monday or Wednesday (I stayed at CoRE instead of the CINJ all day on Tuesday for a very interesting REU seminar), but unfortunately there have been very significant tech issues. Since I don't have access to or permissions for the server the files are on and the files are quite large, it's a struggle to get access to them. I met with Dr. De and Dr. Kabiraj, and we've decided that I'll give my code to Dr. Kabiraj so he can run it on the server. In the mean time, I'll start working on a smaller sample of data, and will probably end up mostly assessing a single type of cancer. As I don't have that data yet, I'm continuing to experiment with NMF with the data that I do have.
Week 6:
On Monday, I figured out how to get our output files into one big matrix for analysis with NMF. I also started practicing interpreting the NMF results and figuring out what rank is best. I still didn't have the files I will actually be analyzing, so it was mostly practice. Additionally, I started to write the introduction and abstract for my final report. I started outlining the other sections, but it is too early to write anything truly substantial.
Tuesday was the 4th of July, so I celebrated with some family! On Wednesday I was able to get the smaller data set of files and processed all of them to get the frequency tables. This means that starting on Thursday, I was able to focus on the NMF! It was very interesting because I've never worked with a lot of the concepts involved. I've been reading a myriad of papers on the concept, and have managed to figure out how choosing ranks work. All of them (for the starting and ending sequences as well as the size groups of fragments) hover around the same rank which makes sense because the data sets are all similar sizes. I've now also constructed preliminary coefficient and basis matrices for the short fragments, longer fragments, and all fragements. I'm going to continue reading as well as working on my final presentation and report.
Week 7:
As of Monday, I'd finalized what I believe the ranks and matrices for my NMF analyses are based on variables such as residuals and the cophenetic correlation coefficient. Much of this week was dedicated to working on my final presentation and report. This involved reading a lot of papers, including ones that I'd already read, very in depth to make sure I understand the material well and am producing an accurate and meaningful report of my results. On Wednesday I met with Dr. De and Dr. Kabiraj. We have added several more cohorts to analyze.
On Thursday we went on a field trip to IBM, which was really cool! It took the entire day, and gave me a much better idea of what industry jobs in mathematics, statistics, and computer science look like. There were a lot of talks about AI, and we got to see the quantum computers they're developing in person. We got back to Rutgers pretty late, but I saw that Dr. Kabiraj had sent me the extra data, so I started in on data processing. That's most of what I did Friday as well. I also created a lot of heatmaps, though they need to be cleaned up before they can be put in the final presentation.
Week 8:
At the start of the week, I finished making heat maps and started making them more legible by creating them from smaller subsets of data (ex: by cancer type or fragment length rather than everything all together). I also switched R packages several times (I ended up using ComplexHeatmap from Bioconductor) because I wasn't happy with how the legends and dendrograms from the original ones looked. They're much more comprehensive now. I also made a ton of sequence logos to look at differentiations between the start/end of different cancer types and fragment sizes. On Tuesday, I was almost done with my final presentation, but needed to finish the results/discussion section and add a few figures.
I made even more heat maps on Wednesday and met with Dr. De and Dr. Kabiraj to go over my presentation. Overall they were pleased, but they helped me clean up some of the content and now it will flow much better. It's very useful to have people that are used to explaining biological concepts to a variety of audiences. We decided that I'd normalize the rows of my NMF matrix because the clustering in the heat maps looked off and it was likely due to different total numbers of samples from a given group. Afterwards, it turned out that the clustering we'd seen actually wasn't there, so our result is essentially a negative. This isn't a bad thing as it raises plenty of research questions to delve into.
The first round of final presentations started on Thursday, and it was really interesting to hear what everyone had been working on all summer, especially since most people are in different specialties (and general areas) than I am. There was a lot of cool statistics. I finished up my slides in the afternoon and spent the rest of the day practicing my presentation. On Friday, I presented. It went quite well! Several friends and family attended via zoom, which I was very appreciative of. I'm now working on my manuscript, and hope to have it done early next week. I already have a first draft done!
Week 9:
I officially finished my technical report (manuscript) on Monday, and I also completed my review of the REU! I cleaned and commented all the code that I made and handed it off to the De lab. I was available for the rest of the week to help with any issues with the code, and will continue to be available for minor techical issues. I spent the rest of the week having fun and exploring the area (and packing) before leaving on Friday. I'm so thankful to have had this experience and truly learned so much. Thank you to everyone at DIMACS (especially Dr. Gallos and Caleb), Dr. De and Dr. Kabiraj, and my fellow REU students!

Presentations


Additional Information


Acknowledgements

This work is being done at the Rutgers DIMACS REU program. Thank you to all those who work to keep the program running. Thank you as well to the NSF for funding this project through grant CNS-2150186. Thank you especially to Dr. De and Dr. Kabiraj for providing guidance and support.