Student: | Aaron Schankler |
---|---|
Office: | CoRE 448 |
School: | Haverford College |
E-mail: | aarons [at] reu [dot] dimacs [dot] rutgers [dot] edu |
Project: | Labeled Trees and Supply Chain Reconstruction |
My project initially involved making statements about the common contractions of graphs. We defined an edge contraction as an operation which deletes an edge and merges its incident vertices. A graph H is a contraction of a graph G if and only if G can be transformed into H by applying edge contractions. I hoped to design an algorithm to find the largest common contraction of two graphs that are known to share a common contraction and then generalize this to report common contractions of arbitrary sets of graphs.
The problem was initially motivated by a desire to trace the distribution process of drugs (perhaps by defining the number of contractions needed to reach the common contraction as a measure of similarity between samples), but I intend to focus on designing the algorithm rather than this specific potential application. Unfortunately, we found graph contraction problems to be computationally difficult, so we have begun to explore different methods of reconstructing the supply chain.
We are currently attempting to use numerical techniques similar to simulated annealing to reconstruct a supply chain where the proportion of cutting agents present at each of the distributors is known.
This week I got most of the details of my project sorted out and read some of the papers that initially motivated the problem and which dealt with applications of graph theory to drug profiling. I also prepared my first presentation. Near the end of the week, I coded some functions in Mathematica which expand vertices and contract edges. Hopefully these will be called by implementations of my contraction-finding algorithm.
I realized that the methods I used to expand vertices and contract edges required computationally expensive conversions to and from Mathematica graph objects, so I implemented similar functionality on adjacency matrix representations of graphs. These new functions perform much better on large graphs, when mapped onto large datasets, and especially when applied recursively onto their own output. I also implemented functions that record a “lineage” for vertices so that we can see what aspects of the original graph produce features in the expansions.
Using these new tools, I first began looking at isomorphisms in the possible expansions of test graphs. I identified several distinct “types” of isomorphism. The first arises because I treat the new vertices as distinct from one another. Depending on the application, this behavior may or may not be desirable, but either way, this behavior produces one isomorph for almost every expansion. Other types of isomorphism are due to equivalent vertices in the parent graph, though the multiplicity of each isomorphism (and so the number of topologically distinct expansions) is dependent on not just the number of equivalent vertices but also how equivalent they are.
Next week, I intend to more rigorously state my findings on isomorphisms, and also try to find features that persist through expansions.
This week I tried to formulate a “simplest case” of the graph contraction problem to see if it was solvable. The problem I reached is “given graphs G and H, can H be obtained by a sequence of edge contractions on G?” I then wrote a naïve program which tries to contract all possible combinations of edges as a proof of concept to be improved upon. I also found a few cases where the question can immediately be answered with a “no,” though some generalizations of these checks are computationally difficult.
Simplifying the problem also made it possible to find some previous research. It turns out that my formulation of the problem with H fixed has been previously studied as the “H-Contractability Problem.” Of particular note is that the problem is NP-complete unless H is exceptionally small and dense; for example, H-Contractability has been proven NP-complete for H=P_{4} and H=C_{4} (a length four path or cycle respectively). Because the problem is very difficult and we do not have enough background on the applied problem to make simplifying assumptions about the input, we intend to move from this contraction method to a different method for reconstructing the drug distribution network.
Our new model will represent the supply chain as a tree or a directed acyclic graph. We will first make some assumptions – for example that each distributer will use exactly one cutting agent – and then attempt to solve this simplified problem before relaxing these constraints.
Our new model for a supply chain is a tree where each node represents a distributer. We make several simplifying assumptions; the most important is that we know the proportion of cutting agents in samples at the leaves of the tree. We hope to find a way to reconstruct tree structure (more precisely sibling relationships) with this information.
To begin addressing this problem, I tried to recover the parameters (the proportion of cutting agent added at each node) in a depth two binary tree where each node cuts with the same agent but at a different proportion from the other nodes. This will give us an idea of what happens when we predict a correct structure and possibly could be extended into a method of verifying or rejecting proposed tree structures (if we can not recover the parameters the structure may not be correct).
To recover the parameters, I used a numerical technique similar to simulated annealing that starts at a random guess of the parameters and then randomly samples nearby guesses, favoring those that reduce the total error. This week with some help from my mentor I constructed an implementation which can get relatively close to the correct parameters of a depth two binary tree. Next week I intend to refine this into a way of differentiating between correct and incorrect structures.
This week, I started with some significant restructuring of the code that I wrote last week. This restructure reduced unnecessary work done at each step by pregenerating symbolic expressions that can be evaluated at each step and streamlining the way that new guesses are generated. I also put in place a routine that tends to mimic successful moves in the hope that they will continue to be beneficial. Thanks to this restructure, my code is flexible enough to handle differently shaped trees and, more importantly, finishes in about 10% of the steps that were previously needed to reach a solution.
With a working solver, I began looking at exactly what sorts of solutions it produced, both in terms of accuracy and in terms of the speed that the solutions were reached. Near the end of the week I began running tests with incorrect guesses at tree structures to see how well this method would work at rejecting these incorrect structures. The results were not particularly promising, and it seems that for moderately sized trees, there are enough degrees of freedom for the solver to reach solutions for many incorrect structures.
After the tests that I conducted last week, it became clear that the case with only one cutting agent was not constrained enough to differentiate between correct and incorrect tree structures. With this in mind, we decided to investigate how a second cutting agent would change our results. I modified the structure of my program again in order to accommodate an arbitrary number of cutting agents, as well as to decrease the time it takes the program to reach a solution. The results were as expected: increasing the number of cutting agents increases the number of incorrect solutions eliminated, but makes finding any solution at all more difficult, even on correct structures.
This week, I wrapped up some of the tests of the multi-agent problems and found that in small cases of the problem, many incorrect supply chain configurations can be eliminated. However, a major downside of this method is that it is computationally expensive to test a failing configuration. At the end of the week I started exploring ways to speed up rejection. I also gave my second presentation.
I spent some time this week trying to dynamically construct a statistical model during testing to help reject failing configurations faster. Unfortunately, I ran into some issues because the distributions in question were not normally distributed, because of the way Mathematica handles empirical distributions, and because in the process of exploiting the new information, I distort the new data, creating a sort of self-fulfilling prophesy.
This week was mostly occupied with writing my final report. I did spend a little time looking at the asymptotic complexity of our testing method, and found that the number of distinct “supply chain like” (i.e. rooted and always branching) trees that can be constructed with n leaves is given by the OEIS sequence A000669, which grows roughly exponentially.