||University of Pennsylvania
||apershan (at) sas.upenn.edu
||Statistical Algorithms in Population Genetics
I'm a rising senior from Chicago, Illinois interested in probability and seeing real world applications of the math I learn. I'm also interested in relaxing anywhere with a good book, playing frisbee, and learning about my own heritage.
The problem I am working on this summer is local ancestry detection in an admixed genome. For example, African Americans have some European ancestry and some West African ancestry so their genomes are a mosaic of segments of European and West African ancestry. It is important for many applications to figure out the local ancestry of each segment. For example, a recent study found Native American ancestry associated with a higher occurence of relapse in children suffering from a certain type of leukemia. The specific summer project is to conduct experiments with a number of different algorithms for detecting local ancestry using a set of population genetics data that Dr. Chen's group has already worked with extensively.
- Week 1:
- After moving in and orientation, I got to work reading some papers and learning about Hidden Markov Models (HMMs), since they are used in many of the local ancestry algorithms I've seen so far. I found a good concrete example in the Wikipedia article and an explanation of some of the math here. I met with Dr. Chen and we worked through a paper describing the LAMP-LD software package, which uses a hierarchy of HMMs as well as window-based processing of the genome. Otherwise I've been working on my presentation, which I give this Friday, and exploring the campus.
- Week 2:
- This week I focused on running LAMP-LD on the yeast data, which for me mostly involved programming in python. For example, the genetic code (written with A,T,G,C) had to be encoded into 0s and 1s, where 0 represents the more common allele (i.e. 'C') at that genetic location, and 1 represents the alternate allele. This had to be done across 35 strains of yeast, with over 170,000 alleles, so writing a short script made that a lot easier. And as of yesterday afternoon, I have the local ancestries of all the mosaic strains of yeast.
- Week 3:
- This week I started working with the data from LAMP-LD. Gametes are formed with whole choromosomes from each parent, so a process called recombination (swaps of chunks of genes) is what contributes to the mosaic nature of the admixed genome. What's interesting is that since this process happens only once per generation (for a given geneology), the length of continuous ancestry tracts in the genome can be used to figure out the amount of generations since admixture. If the tracts are very short, there have likely been many swaps, i.e. many generations. This week I read a paper that discussed some mathematical models for this relationship. The author, Gravel, has equations for both the mean tract length and variance that depend on two parameters: generations since admixture and initial migration rate. It's a bit tricky so far trying to solve these equations using the observed data I have, and I'm still in the process of trying to arrive at a good guess for this timescale.
- Week 4:
- This week I continued in my effort to infer a time of admixture. A couple of dead ends from last week mean that I'm still working on a good optimization technique for figuring out which combination of these two parameters fits our data best.
- Week 5:
- After talking to Dr. Chen on Monday, it seems that the tract length distribution for some reason just isn't exponentially distributed. This in turn means that I can't fit the Gravel-model's equations to the data as of right now. Dr. Chen instead had me research a newly published method specifically for inferring the date of admixture, called ROLLOFF. ROLLOFF basically looks at the drop-off of admixture based LD (linkage disequilibrium) between all pairs of SNPs.The paper itself also present some pretty interesting anthropological findings of African admixture in some Middle East and Southern European populations, so that's been interesting as well. I'm presenting the paper on Monday to both Dr. Chen and Liyang.
- Week 6:
- This week I focused on understanding the ROLLOFF method. The ROLLOFF paper doesn't go into a huge amount of mathematical detail, so I found it a little hard to code up myself. The code I finally produced was slow, however, and the output wasn't as clean as I would have liked. In order to ensure an accurate answer, Dr. Chen sent the authors of the paper an email asking for their software they produced in conjunction with the paper. I'll be starting to run that next week, and hopefully I'll finally have an answer for the dates of admixture. Just in time for my final presentation next Thursday.
- Week 7:
- I wrapped up ROLLOFF this week and got some interesting results. Using the pairwise outputs from ROLLOFF, I was able to construct a rough idea of the phylogenetic tree of the mosaic strains. There was also an interesting post-facto confirmation of those results from the old tract length data. The pictures and results are all in my final presentation, which I gave on Thursday. Hard to believe the program is wrapping up already.