DIMACS REU 2004
Participant:
Logan Everett
eMail: loganism@gmail.com


Mentor:
Endre Boros
eMail: boros@rutcor.rutgers.edu

Abstract: Project in Protein String Encoding
A common task in Bioinformatics is the search of sequence
databases for matching sequences. In protein sequence databases,
searching is further complicated by both the increased amount of
data and the complexity of sequence similarity. Protein
similarity is not simply a matter of character matching, but
rather is determined by a matrix of scores assigned to every match
and mismatch [4]. One strategy to increase search speed
is to map sequences into a binary space where the hamming distance
between strings is comparable to the similarities of the original
sequences [5]. Within binary hamming spaces,
statistically proven sampling methods can be used for fast,
reliably sensitive searches [6]. However, mapping the
protein alphabet to a binary hamming space often comes with a
certain level of distortion. We have employed Linear Programming
techniques to model and study the nature of these mapping schemes.
Specifically, we have found the theoretically minimum distortion
achievable for several biological scoring matrices, as well as
corresponding encoding weights. We have also analyzed the use of
these encoding weights to intelligently generate pseudorandom
encoding schemes that approach the theoretical minimum
distortions.
Table of Contents:
 Problem Model
 Background: Genomics vs. Proteomics
 Simple String Comparison by Hamming Distance
 Score Based String Comparison
 Mapping Proteins to NLength Binary Strings
 Problem of Distortion
 Problem Model
 Reduction of Problem Size
 Generating a Weighted Partition Matrix
 Solving For Weights and Minimum Theoretical Distortion
 Analysis and Results
 Interpretation of Results
 Perturbed Data Sets
 Optimal Distortion Values
 Random Generation Experiment
 Selected Results from Random Generation
 Works Cited
Downloads:
 Results Using Updated Model (New!)  We have developed a new model which takes into
consideration both the average and maximum distortion. Initial results from this new model
are presented in these report files.
 Results on BLOSUM40
 Results on BLOSUM45
 Results on BLOSUM62
 Results on BLOSUM80
 Results on PAM40
 Results on PAM120
 Results on PAM250
 Scoring Matrix Data Files  Note: the first line was added to tell our generator
program the size of the scoring matrix and what portion is filled (the rest can be filled in
by symmetry).
 Program Source Files  Contains all code used to create our
Bioinformatics Linear Programing Tool (BLPTool). The source is designed for a Unix or Linux
system with g++. There is a make file, so just type "make tool" and then "make run" to use the
tool. The software can generate CPLEX problems from the scoring matrix data listed above and
also analyze the solutions from CPLEX stored in .bin files.
 Summary Charts  These Excel files contain an analysis of the maximum,
mean, and median distortion at various encoding lengths using each of the three methods
being presented in our upcoming paper.
Method Number Key  This is a brief synopsis of the three methods used for the above data.
 Method 1: Simple Rounding  The weights found in the CPLEX solution are scaled to achieve a
specific total length. Any nonintegral values are rounded (more specifically, we used the floor function)
so that the actual length can come out to be slightly less than intended.
 Method 2: Random Rounding  The weights found in the CPLEX solution are again
scaled to a specific total length. However, in this method the nonintegral values are
rounded up or down randomly, with a probability of rounding up equal to the decimal portion
of the value (i.e. 10.4 has a .4 probability of being rounded to 11 and a .6 probability of being rounded to 10).
In this case the actual length can come out slightly more or less than the intended length,
but the expected value of the length is the exact intended length.
 Method 3: Random Selection  A random encoding crosssection is selected from a distribution of
all encoding crosssections in the solution basis, each one weighted by its solution value from CPLEX.
The random selection is repeated and appended to previous selections until an encoding of desired length is
achieved. In this method, the actual length always matches the intended length.
 Summaries of CPLEX Results  These text files contain the optimal epsilon,
the distance vector used, and the nonzero x variables from the solution basis. They were created
by stripping only the necessary data out of CPLEX's rather detailed solution files (which are too
large to put on the website as well).
 Results on BLOSUM40
 Results on BLOSUM45
 Results on BLOSUM62
 Results on BLOSUM80
 Results on PAM40
 Results on PAM120
 Results on PAM250
 Presentations  Powerpoint Slides from my presentations.