
DIMACS REU 2004
Participant:
Logan Everett
e-Mail: loganism@gmail.com
|
|
Mentor:
Endre Boros
e-Mail: boros@rutcor.rutgers.edu
|
Abstract: Project in Protein String Encoding
A common task in Bioinformatics is the search of sequence
databases for matching sequences. In protein sequence databases,
searching is further complicated by both the increased amount of
data and the complexity of sequence similarity. Protein
similarity is not simply a matter of character matching, but
rather is determined by a matrix of scores assigned to every match
and mismatch [4]. One strategy to increase search speed
is to map sequences into a binary space where the hamming distance
between strings is comparable to the similarities of the original
sequences [5]. Within binary hamming spaces,
statistically proven sampling methods can be used for fast,
reliably sensitive searches [6]. However, mapping the
protein alphabet to a binary hamming space often comes with a
certain level of distortion. We have employed Linear Programming
techniques to model and study the nature of these mapping schemes.
Specifically, we have found the theoretically minimum distortion
achievable for several biological scoring matrices, as well as
corresponding encoding weights. We have also analyzed the use of
these encoding weights to intelligently generate pseudo-random
encoding schemes that approach the theoretical minimum
distortions.
Table of Contents:
- Problem Model
- Background: Genomics vs. Proteomics
- Simple String Comparison by Hamming Distance
- Score Based String Comparison
- Mapping Proteins to N-Length Binary Strings
- Problem of Distortion
- Problem Model
- Reduction of Problem Size
- Generating a Weighted Partition Matrix
- Solving For Weights and Minimum Theoretical Distortion
- Analysis and Results
- Interpretation of Results
- Perturbed Data Sets
- Optimal Distortion Values
- Random Generation Experiment
- Selected Results from Random Generation
- Works Cited
Downloads:
- Results Using Updated Model (New!) - We have developed a new model which takes into
consideration both the average and maximum distortion. Initial results from this new model
are presented in these report files.
- Results on BLOSUM40
- Results on BLOSUM45
- Results on BLOSUM62
- Results on BLOSUM80
- Results on PAM40
- Results on PAM120
- Results on PAM250
- Scoring Matrix Data Files - Note: the first line was added to tell our generator
program the size of the scoring matrix and what portion is filled (the rest can be filled in
by symmetry).
- Program Source Files - Contains all code used to create our
Bioinformatics Linear Programing Tool (BLPTool). The source is designed for a Unix or Linux
system with g++. There is a make file, so just type "make tool" and then "make run" to use the
tool. The software can generate CPLEX problems from the scoring matrix data listed above and
also analyze the solutions from CPLEX stored in .bin files.
- Summary Charts - These Excel files contain an analysis of the maximum,
mean, and median distortion at various encoding lengths using each of the three methods
being presented in our upcoming paper.
Method Number Key - This is a brief synopsis of the three methods used for the above data.
- Method 1: Simple Rounding - The weights found in the CPLEX solution are scaled to achieve a
specific total length. Any non-integral values are rounded (more specifically, we used the floor function)
so that the actual length can come out to be slightly less than intended.
- Method 2: Random Rounding - The weights found in the CPLEX solution are again
scaled to a specific total length. However, in this method the non-integral values are
rounded up or down randomly, with a probability of rounding up equal to the decimal portion
of the value (i.e. 10.4 has a .4 probability of being rounded to 11 and a .6 probability of being rounded to 10).
In this case the actual length can come out slightly more or less than the intended length,
but the expected value of the length is the exact intended length.
- Method 3: Random Selection - A random encoding cross-section is selected from a distribution of
all encoding cross-sections in the solution basis, each one weighted by its solution value from CPLEX.
The random selection is repeated and appended to previous selections until an encoding of desired length is
achieved. In this method, the actual length always matches the intended length.
- Summaries of CPLEX Results - These text files contain the optimal epsilon,
the distance vector used, and the non-zero x variables from the solution basis. They were created
by stripping only the necessary data out of CPLEX's rather detailed solution files (which are too
large to put on the website as well).
- Results on BLOSUM40
- Results on BLOSUM45
- Results on BLOSUM62
- Results on BLOSUM80
- Results on PAM40
- Results on PAM120
- Results on PAM250
- Presentations - Powerpoint Slides from my presentations.