I will be investigating the relationship between edit distance and q-gram distance, two measures of comparison between strings, as they apply to read mapping to reference genomes. Read mapping consists of matching shorter sequences of DNA sequences to a longer, reference sequence. Edit distance and q-gram distance are related by the following equation:
QD ≤ 2qED
My goal is to further define the relationship between edit distance and q-gram distance under various cases and possibly determine the probability distribution of the q-gram distance with respect to edit distance.
- Week 1:
- I met with my mentor and co-mentor to discuss my goals for the project, which is to investigate the relationship between q-gram distance and edit distance. After discussing with my mentor/co-mentor and doing some initial reading, I created my first presentation that was delivered on Friday.
- Week 2:
- I did further reading and discussed with my mentor/co-mentor about the theorems and proofs that I will need to understand and possibly expand upon. After getting a grasp of these concepts, I worked out a few examples for my own benefit and then attempted to apply the ideas to slightly different cases.
- Week 3:
- I discussed with my mentor and co-mentor to further define some short term goals. I was also introduced to Latex. I worked on developing a proof that describes the upper bound of the probability of the L1 distance when considering 1-grams and one indel. I typed up my proof in Latex.
- Week 4:
- I made revisions to my 1-gram proof as per suggestions from my mentor and co-mentor. I also started working towards creating a similar proof for 2-grams. I had two meetings with my mentor and co-mentor in which we discussed ideas for the 2-gram proof.
- Week 5:
- I made further revisions to my 1-gram proof and continued working on the 2-gram proof. Unfortunately, we have not been able to formulate a good argument yet, but my co-mentor came up with an idea that involves Eulerian paths that I will try to further define next week.
- Week 6:
- I finished up my 1-gram proof for now and focused on developing the 2-gram proof. I did some reading on Eulerian paths and worked out some complications with the 2-gram proof.
- Week 7:
- We formulated a promising conjecture for the 2-grams, although we must now develop a solid proof for it. I created my final presentation that was delivered on Thursday.
- Week 8:
- In Prague.
with q-grams and maximal
matches, Esko Ukkonen
Indel-tolerant Read Mapping with Trinucleotide
Frequencies using Cache-Oblivious kd-Trees, Md Pavel Mahmud, John Wiedenhoeft, and Alexander Schliep.