DIMACS
DIMACS REU 2019

General Information

me
Name: Theodora Katsarou
Email: Theodora.Katsarou@stonybrook.edu
Office: CoRE 434
Home Institution: Stony Brook University
Project: Statistical Inference of Mutational Patters in Clonal Genomic Data

About My Project

This project aims to capture genomic diversity in sub-population structures while utilizing tree dimensionality reduction to better understand evolution. Tree dimensionality reduction is beneficial for easier statistical analysis while creating a uniform metric to calculate distance between data. Phylogenetic trees have been used in biology to graphically represent hierarchical relationships between genes. Hence, reducing the dimensionality of large phylogenetic trees is a key to better understanding large data sets. Throughout this project, I studied what methods are best suited for tree dimensionality reduction and how I can use those methods to reconstruct trees that suit my data in order to analyze it further. I studied DNA sequences of influenza hemagglutinin (HA) dating from 1993-2018 and used several statistical techniques and algorithms to diminish the number of nodes in the phylogenetic tree that captures all the data. Visualizing these trees In a three dimensional space has shown the different branching patterns within those years that exemplify the small chronological changes that occur as the virus replicates, as well as the frequent reassortment of influenza. The methods used throughout the project are beneficial in visualizing and analyzing any type of clonal genomic data.

The goal of the summer research project is to plot data into phylogenetic trees, and analyze them statistically. To accomplish this, I will be using Python to write algorithms that sort the data into phylogenetic trees.

Research Log

Week 1

This week I was busy with move in and orientation, and hope to be able to do more research next week. I also met with my mentor and discussed the articles I have read on cancer genomics, and high-throughput sequencing as well as prepared my presentation.

Week 2

This week, I began writing a program to sort phylogenetic trees in python starting with researching molecular phylogenetics through an article published in nature reviews by Yang and Rannala. Earlier in the week, I analyzed the geometry of the space of phylogenetic trees in a paper written by Billera, Holmes, and Vogtmann. This paper is especially helpful in my research because it shows how to combine phylogenetic trees that are in different spaces, and joins them together into a Euclidean space. This is helpful in order to statistically analyze trees using Euclidean properties.

Week 3

This week, I continued working on the code that calculates distance of phylogenetic trees and found that the best way to do this is to use a neighbor joining algorithm. After completing the neighbor joining algorithm in python, I made a Pdist function in Matlab using hamming distance to check that my code calculates pairwise distances correctly. I will need to continue working on the algorithm because although it is running, it is only calculating one tree at a time and I need it to produce hundreds of trees per year.

Week 4

This week, I learned how to use the Mega program in order to calculate the distance matrix for my data. My data consists of over a thousand DNA sequences for the EpiFlu between the dates 1993-2018, and each year has about 50 sequences. The Mega program calculated the distances between these taxa and produced a matrix that is 1700x1700, as well as a phylogenetic tree with more than one thousand nodes. There is a slight problem here because these trees are so large that it is hard to analyze them, so within the next weeks, I plan to make an algorithm that sorts the sequences chronologically. I will do this in hopes of producing phylogenetic trees that capture a specific years data instead of data between random years that are not chronological. This is important because phylogenetic trees are made and used in biology to show evolutionary relationship, and if the data is not arranged in the right order the results are almost pointless.

Week 5

I decided the best way to reduce the dimensionality of the trees is to tweak the neighbor joining algorithm into producing trees that have 3 nodes each rather than thousands of nodes. This means that the algorithm would need an upper boundary of about 100 trees per year, otherwise it could create thousands per year. The algorithm would take the large distance matrix I have previously calculated but use a 3x3 distance every time until all the distances are used. The neighbor joining algorithm is perfect for this because in each iteration of the algorithm, the two nearest nodes of the tree are chosen and defined as neighbors. This is done repeatedly until all of the nodes are paired together. This algorithm starts with a star tree that represents the distance matrix perfectly because the columns and rows of the distance matrix represent the nodes of the tree while i, and j represent the distance between node i, and j. Then you add a node between the nodes with the lowest value in the matrix and compute the branch lengths for the branches that have been joined. Then, all the steps are repeated over and over again. This algorithm was originally proposed by Saitou and New in 1987 in a paper called The Neighbor Joining Method: A New Method for Reconstructing Phylogenetic Trees, and I have summarized how my algorithm works below in 8 steps.

1) Align the sequences

2) Build distance matrix between the sequences

3) Compute the ri’s for each one of the nodes in our matrix

       a. ri is the sum between the distance i and k.

4) Compute M(ij). Mij= dij-(ri+rj)/(N-2)

5) Define a new node whose three branches join i, j and the rest of the tree.

       a. Define the lengths of the tree branch from u to i and j:

       b. L(iu) = dij/2+(ri-rj)/2(N-2)

       c. Lju=dj-Liu

       d. These distances are the lengths of the new branches

6) Define the distance from u to each other node as: Dku = (d(ik)+(d(jk))/2

7) Remove the distances to node i and j from the data matrix, and decrease N by 1.

8) If more than two nodes remain, go back to step 1. Otherwise, the tree is fully defined except for the length of the branch joining the two remaining nodes (i and j).

Week 6

The algorithm is now mostly complete and I have used it to calculate the new branch lengths of the distance matrix. The algorithm produced the trees in a newick tree format which, as I have learned, is a representation of trees in computer readable form using nested parentheses and commas. Newick tree format is hard to visualize and read for humans, but it is useful for exchanging trees between different types of software or using it as an input for a different code. This week, I am working on using the output of the neighbor joining algorithm and using it as an input in another script that makes it easier to visualize the trees in a three dimensional space. From the tree in newick format, you can calculate the branch lengths, and those branch lengths represent the position of the point in the positive quadrant in R3 (three dimensional).

Week 7

This week is the week of the final presentations. Although I am not done with my research, I have included the work I have done thus far. In addition, I am still trying to learn how to use python packages such as BioPython in order to convert my newick tree format into three dimensional structures in order to visualize them. I have also started working on the final paper.

Week 8

I have completed the program that converts newick trees into three dimensional structures using Python scripts. This script aids in calculating he branch lengths that represent the position of the points in the positive quadrant of R3. After calculating the branch lengths, the coordinates get mapped into a three dimensional structure. The trees that have three nodes were mapped onto a triangle where all the points fall within vectors <1,0,0>,(0,1,0>,<0,0,1>. This code works through a function that takes a three leaf tree and returns the two dimensional coordinates for external projective space. The newick trees that have four nodes connecting them are mapped onto a three dimensional tetrahedron through a function that returns the three dimensional coordinates for external projective space from a four leaf tree. The function also takes a four leaf tree and returns which of the tree topologies it displays, along with the lengths of the internal branches. After looking at the topologies of certain years, I decided to compare and extremely branched structure (2006), with moderately branched (2011) and linear structures(2014), to understand more as to why some years of influenza exhibit a more branched model of growth than other years. Through branching of data, it is clear that influenza reassorts over time creating a new virus that holds genetic information from different strains. The linear data exhibits antigenic drift, which also is a mechanism for variation. There are clear small changes in the genes of influenza virus that happen continually over time as the virus replicates. In other words, influenza had a greater rate of reassortment between years 2002-2006 than 2007-2011.

Week 9

I decided to compare the results from the previous week to the Centers for Disease Control and Prevention (CDC) Seasonal flu vaccine effectiveness studies, which measure how well flu vaccines are working for particular years. The data on the CDC website for the years mentioned last week(2006, 2011, 2014) show that the overall VE %(vaccine effectiveness) for 2006 has an average of 19%, 2011 has an average of 48.5%, and 2014 has an average of 52%. Thus, these percentages show a negative correlation between high VE % and branching models of growth. In other words, the lower the vaccine effectiveness, the more branching occurs in the topology structure. This can be explained through the lack of effectiveness that vaccines have when influenza virus reassorts. These high mutation rates and genetic reassortment enable the virus to cause repetitive influenza outbreaks by evading immune recognition. The amount of vaccine effectiveness is significantly lowered by these mutation rates as the protective effect induced by one strain may be reduced or lost as a function of time, resulting in individuals being unprotected against the new strains in circulation. We have concluded that evolutionary patterns can be inferred based on graphs of diverged or linear phylogenetic trees. By visualizing the data sets in the tree topology graph, we are able to make conclusions about the reassortment of influenza viruses, but more importantly these graphs can be used for easier statistical inference in any clonal genomic data.

Additional Information

Here are my mentor's website, and the REU website: