General Information

Student: David Orion Girardo
Office: Hill Center, Room 270
School: Worcester Polytechnic Institute
E-mail: dogirardo@wpi.edu
Project: Single Cell Assembly with Reliable Data

Project Description

Recent advances in sequencing technologies have created an overflow of raw genomic data, greatly accelerating the pace of biology research. However, sequencing technology produces only many short random fragments of the genome. These 'reads' must be 'assembled' computationally to produce a quality representation of the genome. Designing faster and more accurate algorithms for genome assembly is an ongoing field of research.

Assembly of genomic data obtained from single cells presents unique difficulties. The small amount of genetic material must be amplified asymmetrically before sequencing, altering the distribution of sequences and making it harder to filter sequencing errors. My project is to develop an effective method of sequence filtering for single cell data. We start by considering the Hamming Neighborhood around frequently observed sequences, considering that any particular mutation is rare, the most frequent true sequences will spawn many different similar sequences. The merits of this initial method is analyzed and improved upon.

Weekly Log

Week 1:
I Spent the first week attending seminars and finalizing between possible projects. This involved reading lots of papers about genomic assembly, single cell genomics, cancer research, and web tools. We finally worked out the project described above. After this I put together a presentation and presented on Friday after revising and practicing a few times. Over the weekend I began familiarizing myself with bioinformatics tools that may be useful.
Week 2:
In this week I explored the connection between kmer coverage and original read coverage. I developed a tool to identify areas of low kmer coverage and return information on the original read coverage in that area. I also spent time exploring the setup of the UCSC genome browser but it proved unnecessary.
Week 3:
Analysing results from the previous week, we found that our technique is effective for some datasets but not others. The remainder of the week was spent discovering the differences between datasets that lead some to be effective and others not.
Week 4:
To think about last weeks problem, I generated many different summeries of the data to contemplate. I spent most of this week initializing the final report to help generate ideas.


Additional Information