I am a rising senior at Carnegie Mellon University pursuing a double major in mathematical sciences and materials science.
I am participating in the DIMACS REU program to gain research experience that will be valuable for my future educational endeavors.
Phylogenetic trees are representations of the evolutionary relationships between species as labeled trees.
They can be created by sequencing the DNA of different individuals and examining the overlapping sequences; however, this generally results in several possible phylogenetic tree candidates.
In order to analyze these candidates, it is helpful to have a notion of distance between two phylogenetic trees.
The Robinson-Foulds metric is a measure of the level of similarity between two labeled trees.
Efficient algorithms are known for computing the Robinson-Foulds metric between a pair of trees.
Suppose we have a large collection of trees, and desire to find all pairs of trees which have a Robinson-Foulds metric below a particular threshhold.
Calculating the Robinson-Foulds metric between every pair of trees is a very time consuming way to accomplish this goal.
We intend to find a method of identifying all pairs of trees with a small enough Robinson-Foulds metric that has a lower time complexity than the brute-force approach.
Note: For the first two weeks of the program, I was working on a different project with Dr. Abello.
The goals of that project are described in the initial presentation.
- Week 1:
- I became acquainted with the REU program, Dr. Abello, and the group I would be working with.
We planned out the features of our project and started on an introductory project that would showcase many of these features by making a website using data from the World Cup.
The intent is to create a website centered around a graph-based presentation of information, where each node of the graph is a "card" describing an item of interest.
For the World Cup website, each card will be a participating player; later on, we will make each card a project from different REU programs.
On Friday, we presented the main concepts and goals of the project.
- Week 2:
- We continued to work on the World Cup Project.
I was responsible for constructing and accessing databases for data on the soccer players and users of the website.
We also worked on the interface and design of the website.
I started doing research for the phylogenetic tree project with Dr. Zola, becoming acquainted with the basic terminology of the field and existing techniques for computing the Robinson-Foulds metric.
- Week 3:
- I began to investigate different ways of representing trees that might be useful for estimating the Robinson-Foulds metric.
I also learned how to use the dendropy Python library as a convenient way of automatically calculating the Robinson-Foulds metric between any two trees for testing purposes.
Dr. Zola and I outlined the approach we plan on using to reduce the runtime necessary to find sets of trees that are close together.
I researched information on MinHash, which is a method of estimating the degree of similarity between two sets in linear time.
Such a method will be very valuable if we can find a way to reduce the problem of comparing trees to a problem of comparing sets.
- Week 4:
- The main obstacle we have to overcome with tree comparisons using a set-based approach is finding an algorithm that can be used to hash trees in linear time and that is min-wise independent, which means any value of a set is equally likely to be the minimum after hashing.
I looked up information on hashing functions that are known to be approximately min-wise independent to see if any of them could be adapted for use with trees.
Eventually, I found a hashing technique that is close to min-wise independent that I was able to adapt to be computed recursively over the nodes of a tree.
I wrote a Python program that randomly generates pairs of trees and uses MinHash to estimate the Robinson-Foulds metric between them.
Using this program, I was able to test the accuracy of the hashing approach by comparing its estimate to the exact Robinson-Foulds metric between pairs of trees.
- Week 5:
- I made the tree generation and randomization algorithms more efficient to enable testing with much larger trees.
I ran several tests on trees with up to 20000 leaves to evaluate the accuracy of the hashing technique, using the R² value between the actual Robinson-Foulds distances and the hashing approximations to measure the quality of the estimator.
Unfortunately, after the tests I noticed a bug in my code that was causing the estimated Robinson-Foulds distances to be slightly off in some cases.
Hopefully this will only require minor tweaking to remedy.
- Week 6:
- It required a lot more than minor tweaking to remedy.
After purging the bug from my program, I computed some more Robinson-Foulds distances, this time using the hash function to get much smaller subsets than I was taking before (which speeds up the algorithm at the expense of accuracy).
This data will be included in the paper we are planning to eventually write.
Then, we noticed a more fundamental issue with the algorithm: in the worst case, it does no better than the existing technique we were trying to improve upon.
I meditated on methods of getting around this issue, but couldn't come up with anything that would work.
- Week 7:
- I checked the distribution of some actual phylogenetic data to see if the worst-case situation would occur frequently in practice.
Unfortunately, it seems that such an occurrence would be common.
For now, it looks like our approach to the problem has been thwarted, but hopefully future work will be able to build on what we've found thus far.
I spent the rest of the week working on my final presentation and report.