Connie Zhang, DIMACS REU 2021

General Information


About My Project

Understanding complexity, dynamics, and stochastic patterns in genomic data - concepts native to physics and mathematics - is critical for elucidating how disease states originate and evolve. In this project, we focus on the application of the Tunable Biclustering Algorithm (TuBA) to examine genetic and clinical data of cancer patients, aiming to identify genetic markers that can provide us insight into how cancers evolve. Our ultimate goal is to develop novel statistical platforms for fast translation of genomic data into clinical practice.


Research Log

Week 1 May 24 - 28

On the first day of the REU, I attended the orientation and met with some other participants over zoom. Then, I spent the first half of the week familiarizing myself with the Tunable Biclustering Algorithm (TuBA), which is what I will implement over the summer.

During the second half of the week, I prepared for the presentation that will introduce my project. I gained more understanding about the direction I will take for the project this summer. I also downloaded the copy number and probe map data, and I wrote a function that combines these data with the bicluster information. This function allows me to combine bicluster information for any cancer with the copy number and probe map data, making future explorations more convenient. I started exploring some biclusters by creating visualizations in R.

Week 2 May 31 - June 4

I spent a lot of time this week cleaning data and making visualizations. I created a shiny app for these visualizations, and it made it a lot easier for me to look at visualizations across biclusters. From these visualizations, I can assess the chromosome enrichment in each bicluster and the copy number distribution in each chromosome and bicluster. The most challenging part of the week is combining information across different datasets. I have to extract information that I need from one dataset and combine it with information from one or two other datasets while making sure the combined data can generate the visualizations I want.

I also got to analyze some of the visualizations. I noticed that there are biclusters enriched in genes from a single chromosome, the next step will be analyzing how copy number can be related to chromosome enrichment in bicluster. Next week, I will also work on finding the statistical significance of the chromosome enrichments.

Week 3 June 7 - June 11

I attended the TRIPOD data science bootcamp, and I learned a lot about experiment design and active learning in python. I also went to a talk & tea event, where I got to talk to Dr.Gallos and some other DIMACS participants.

Now I'm running Kaplan-Meier analysis on the biclusters to see if any biclusters show significantly different survival times comparing to others. I am also working on researching about the biological processes of the genes to see if any biclusters exhibit biological processess related to cancer survival. The goal is to identify genes in these biclusters and use them to compute a survival score for each patient.

In order to gain a better understanding of my project, I also spent some time reading papers and watching youtube videoss on related topics.

Week 4 June 14 - June 18

I adjusted the parameters of TuBA. Now the results are more interpretable, I have about 90 biclusters with significantly different survival time from samples not in the biclusters. I also applied TuBA for samples exhibiting the lowest expression.

I spent a lot of time looking into these biclusters and researching abbout genes in these biclusters. I have found that a couple of the smaller biclusters have overlapping geness with the largest bicluster. I also found biclusters with genes on the same chromosome, and another bicluster with significantly better survival than others. The fact that these biclusters contain genes that have been previously associated with cancer is really great news :)

This week, I attended a seminar by Prof. Amy Ogan. I enjoyed hearing how she applied her research in the real world. I also liked Lazaros' talk on ethics in research.

Week 5 June 21 - June 25

This week, I spent most of my time making heatmaps. The heatmaps visualize where biclusters and samples are most concentrated. For example, if we see a group of samples repeatedly appearing in multiple biclusters, we would want to look into the similarity between these samples and biclusters. I have never made heatmaps before, so I did a lot of research on the different heatmap packages in R. Every package has its own pros and cons; some packages are very interactive, which makes it easy to look into details on the heatmap, and some packages make it easy to add bars on the top of the heatmap for further annotation.

Throughout the week, I was also reading a paper on the molecular characterization of bladder cancer. This paper identifies gene copy number and mutations related to bladder cancer survival. I tried to incorporate these information into my heatmaps to identify if there are groups of samples exhibiting the same mutation or copy number changes.

Week 6 June 28 - July 2

I changed the method for calculating distance on the heatmap to binary and reannotated the clusters on the map. I presented my findings at Dr. Khiabanian's lab meeting. I really appreciated the opportunity to present my findings; it allowed me to summarize what I have found and I can use some of these materials for my presentation at the end of the program. During the meeting, we also talked about possibilities for me to apply the analysis I have done to bladder cancer to a pan-cancer analysiss. I think this is a really cool idea!

I also finally decided to use the ComplexHeatmap package in R for my visualizations. It does not provide the best interactivity, but it has the best aesthetic and allows to make custom heatmaps. Using this package, I added top annotations on the heatmap. I found that it is very likely for biclusters with immune response-related genes to have better survival.

On Thursday, I attended the scientific writing workshop. It is very informative and provided me with a structure of how to write the upcoming paper.

Week 7 July 6 - July 9

Previous results show that biclusters enriched with a single chromosome are alo very likely to have amplified or over-amplified copy number. I found a list of biclusters like this and filtered for bicluters with 20 or more samples. Among these bicluters, there are samples and genes that repeatedly appear in these biclusters. I did research on these genes and samples to see if they have any association with cancer.

I learned about different areas of AI at this week's AI conference. It provides great information regarding academia and industry career options in the AI field.

Week 8 July 12 - July 16

We hypothesize that the samples that repeatedly appear in biclusters (enriched by a single chromosome and copy number amplification) may depict homogolous repair deficiency. Patients with HRD are going to have a higher number of mutations, which could make them more sensitive to immunotherapy. Another postgrad working at the Khiabanian lab has been doing research in this area, so I emailed him a list of samples that appeared in 10 or more biclusters and hopefully our hypothesis can be confirmed!

I also started working on my final paper and presentation.

Week 9 July 19 - July 23

This week, I worked on my final presentation and paper. I wrote down everything I have done so far on a piece of paper and organized them to tell a cohesive story for my research. I worked on writing a very specific bullet-note draft for the paper, and this strategy helped me to structure my thoughts when I trasnfer my ideas to academic writing.

I was able to present to my mentor on Tuesday, and he gave me a lot of helpful suggestions on how to structure the overall presentation and how to make the slides more engaging.

On Thursday and Friday, I listened to the student presentations. All of them are super informative and I really enjoyed seeing what other people have been doing this summer! It has been a great experience participating in the DIMACS REU program. I learned a lot about statistical genetics and was able to apply what I have learned in class to research.

This work was carried out while the student Connie Zhang was a participant in the 2021 DIMACS REU program at Rutgers University, supported by NSF grant CCF-1852215.