Logan Everett
e-Mail: loganism@gmail.com
Endre Boros
e-Mail: boros@rutcor.rutgers.edu

Abstract: Project in Protein String Encoding

A common task in Bioinformatics is the search of sequence databases for matching sequences. In protein sequence databases, searching is further complicated by both the increased amount of data and the complexity of sequence similarity. Protein similarity is not simply a matter of character matching, but rather is determined by a matrix of scores assigned to every match and mismatch [4]. One strategy to increase search speed is to map sequences into a binary space where the hamming distance between strings is comparable to the similarities of the original sequences [5]. Within binary hamming spaces, statistically proven sampling methods can be used for fast, reliably sensitive searches [6]. However, mapping the protein alphabet to a binary hamming space often comes with a certain level of distortion. We have employed Linear Programming techniques to model and study the nature of these mapping schemes. Specifically, we have found the theoretically minimum distortion achievable for several biological scoring matrices, as well as corresponding encoding weights. We have also analyzed the use of these encoding weights to intelligently generate pseudo-random encoding schemes that approach the theoretical minimum distortions.

Table of Contents:

  1. Problem Model
    1. Background: Genomics vs. Proteomics
    2. Simple String Comparison by Hamming Distance
    3. Score Based String Comparison
    4. Mapping Proteins to N-Length Binary Strings
    5. Problem of Distortion
    6. Problem Model
    7. Reduction of Problem Size
    8. Generating a Weighted Partition Matrix
    9. Solving For Weights and Minimum Theoretical Distortion
  2. Analysis and Results
    1. Interpretation of Results
    2. Perturbed Data Sets
    3. Optimal Distortion Values
    4. Random Generation Experiment
    5. Selected Results from Random Generation
  3. Works Cited