General Information

Student: Theresa Lye Hill 264A Rutgers, The State University of New Jersey tlye (at) eden.rutgers.edu The Relationship Between Edit Distance and Q-Gram Distance

Project Description

I will be investigating the relationship between edit distance and q-gram distance, two measures of comparison between strings, as they apply to read mapping to reference genomes. Read mapping consists of matching shorter sequences of DNA sequences to a longer, reference sequence. Edit distance and q-gram distance are related by the following equation:

QD ≤ 2qED

My goal is to further define the relationship between edit distance and q-gram distance under various cases and possibly determine the probability distribution of the q-gram distance with respect to edit distance.

Weekly Log

Week 1:
I met with my mentor and co-mentor to discuss my goals for the project, which is to investigate the relationship between q-gram distance and edit distance. After discussing with my mentor/co-mentor and doing some initial reading, I created my first presentation that was delivered on Friday.
Week 2:
I did further reading and discussed with my mentor/co-mentor about the theorems and proofs that I will need to understand and possibly expand upon. After getting a grasp of these concepts, I worked out a few examples for my own benefit and then attempted to apply the ideas to slightly different cases.
Week 3:
I discussed with my mentor and co-mentor to further define some short term goals. I was also introduced to Latex. I worked on developing a proof that describes the upper bound of the probability of the L1 distance when considering 1-grams and one indel. I typed up my proof in Latex.
Week 4:
I made revisions to my 1-gram proof as per suggestions from my mentor and co-mentor. I also started working towards creating a similar proof for 2-grams. I had two meetings with my mentor and co-mentor in which we discussed ideas for the 2-gram proof.
Week 5:
I made further revisions to my 1-gram proof and continued working on the 2-gram proof. Unfortunately, we have not been able to formulate a good argument yet, but my co-mentor came up with an idea that involves Eulerian paths that I will try to further define next week.
Week 6:
I finished up my 1-gram proof for now and focused on developing the 2-gram proof. I did some reading on Eulerian paths and worked out some complications with the 2-gram proof.
Week 7:
We formulated a promising conjecture for the 2-grams, although we must now develop a solid proof for it. I created my final presentation that was delivered on Thursday.
Week 8:
In Prague.

References

• Approximate string-matching with q-grams and maximal matches, Esko Ukkonen
• Indel-tolerant Read Mapping with Trinucleotide Frequencies using Cache-Oblivious kd-Trees, Md Pavel Mahmud, John Wiedenhoeft, and Alexander Schliep.