Reverse Engineering of Biological Systems

Skip ahead to the math or software

Bare-bones description

Suppose you are given a collection of data points describing the behavior of a collection of variables over time-- these collections come in runs of short trials, say ten to twenty lines long, and you posess perhaps two dozen of them.

Is this enough information to determine the relationships between the variables? More precisely, can we discover from this a directed graph which indicates which variables are effecting which others, and even more can we produce signs on the edges of this graph which indiciate the type of causation-- at risk of biasing you by our choice of words, activation or inhibition?

Using an algorithm and software developed by Paola Vera-Licona for her Ph. D. thesis at Virginia Tech, I hope to provide both a systematic process for producing these diagrams along with a rigorous evaluation of its merits and problems.

Some math

The class of process which Vera-Licona's algorithm attempts to model is known as a finite dynamical system. Suppose, given a set S and a positive integer n, we are provided with a collection of ordered collections of n-tuples of S (think about several runs of an experiment which produces output in S in n variables). What we seek is a function f : Sn -> Sn such that for any given element of our collection of collections of n-tuples (think of a single line of the experimental run) we can predict what its next value will be.

The problem seems fairly straightfoward until you attempt it with dozens of variables and data sets containing thousands upon thousands of points. The first simplifying task we can perform to reduce the mind-boggling high dimensionality of the problem is to pair down S from something entirely unreasonable (like, say, several hundred rational numbers) to something much more tractable (for example, a small finite number of integer values).

Vera-Licona's algorithm attempts to surmount this problem in the most drastic way possible-- by reducing the number of states to two; that is, by setting S to be {0,1}. My first task then has been to produce a reliable discretization of the data that seeks to preserve as much of the original information as possible.

Once we have such a function in hand, one simple way of producing our desired directed graph is simply to ask ourselves, for each fi, what variables constitute its domain? As part of my project, I hope to also consider ways of deciding whether this interaction is positive or negative given the form of the function. Such a project is greatly aided by the fact that, as functions over the binary field, every such f is actually a polynomial of S in up to n variables.

Software

The software I have written so far for tying together the parts of the analysis have been written in a mixture of OCaml and unix scripts. OCacml is a fast, strongly-typed mixed-paradigm language, capable of quality functional programmin, imperative and object-oriented programming, and perl-like string manipulation.

They say that the first 90% of writing software is easy, the next 90% is hard, and the last one percent is the hardest. For now this is alpha software. You can get it here.

Please please please read the README. Feel free to email me at mshron@uchicago.edu if you have questions.