Eli Ben-Michael | |

Columbia University | |

Office: 454 CoRE | |

email: eeb2142 (at) Columbia.edu | |

Faculty Adviser: Kevin Chen | |

I am a rising Junior at Columbia University majoring in Mathematics - Statistics and a visiting researcher at DIMACS for the summer through the DIMACS REU Program .

Designing new machine learning methods using techniques from linear algebra and multivariable calculus

We are developing novel computational techniques for inferring parameters of probabilistic models for sequential data, such as time-series or biological sequence data. The specific summer project is to show, either theoretically through formal proofs or empirically by writing simple computer programs, whether certain functions of the data are sufficient to recover the parameters of the model. Our specific biologic application is to segment the human genome into distinct domains based on epigenetic marks.

In particular, this summer I am working on implementing a tensor decomposition algorithm to return the parameters of a Hidden Markov Model. We are using a Hidden Markov Model to find chromatin states from chromatin marks.

Faculty Mentor: Dr. Kevin Chen

Graduate Student Mentor: Liyang Diao

I read some papers on tensor decomposition and the tensor power method and how it can be used to obtain the parameters of various probabilistic models. I met with Dr. Chen and Liyang to discuss my summer project and began reading more papers on Hidden Markov Models and their application to chromatin state labeling. In particular I read about ChromHMM , a program which uses a Hidden Markov Model to find chromatin states from chromatin marks, and about Spectacle , a program written by Dr. Chen's group which modifies the parameter estimation algorithm of ChromHMM. I also gave my first presentation and created this website.

I hammered down the details of my project and began working on it. First I had to reference some previous papers to understand how a Hidden Markov Model can be seen as a multiview model. Once the Hidden Markov Model is translated into a multi-view model we can use the tensor decomposition method to retreive all of the parameters of the model after doing a linear transformation on the data. After reading up on this I began to code up the program in Python to be used on sample chromatin mark data and got as far as the linear transformation by the end of the week.

I worked on the linear transformation of the pairs data and then got to the actual tensor decomposition algorithm. I realized that I didn't know how some of the tensor operations were computed, and went back to the initial reading to find the computational definition. In the power iteration algorithm, a random vector from the surface of the n-sphere has to be chosen, after a little bit of searching on stack exchange I found an algorithm to sample uniformly from the surface of the n-sphere rather than uniformly from n-space and normalizing it to be on the surface of the n-sphere. After this, I finally finished the algorithm in its entirety, but I am having some problems. When I attempt to use the algorithm on the Histone mark data I get some very strange results: many probabilities are less than 0 or greater than 1. In order to test where the problem was coming from, I made a few tensors with a given set of eigenvalues and eigenvectors and ran the decomposition algorithm on them. The algorithm returned the expected eigenvectors and eigenvalues most of the time, but for certain eigenvectors the returned output was entirely inaccurate, there doesn't seem to be a pattern to the eigenvectors that cause this. Despite this, I have seen that the main problem is the linear transformation of the pairs data which projects from 3 views in a multiview model to 1 view, and will have to work on this next week.

I realized that a main problem with implementing the linear transformation was that I did not know how to compute the linear transformation on the triples 3rd order tensor. Another problem was that in the papers that exhibited the linear transformation, the authors were prjecting on to the third view whereas we have to project on to the second view. After talking with Liyang and Jimin we figured out exactly how to permute the indices and how to compute the linear transformation on the triples tensor, so I got to work implementing them. In general, the pairs matrices are not invertable, and the linear transformation that projects onto a view requires taking the inverse of one pairs matrix, so in order to ensure that the pairs matrix is invertible, we must transform the pairs matrix so that it is invertible, which can be done with a singular value decomposition. After implementing all of this I encountered some problems. It is a requirement of the tensor decomposition algorithm that the tensor has eigenvalues greater than or equal to zero, which means that the transformed pair matrix has to be positive semi-definite. When I run the linear transformation on the pairs and triples data, the pairs data sometimes has negative eigenvalues and sometimes even has complex eigenvalues. In the original paper it was proven that the linear transformation on the data would produce a positive semi-definite matrix, so I was not sure what was happening. Upon further inspection I found that the nageative or complex eigenvalues were exceedingly close to zero, and in fact, there were k greater than zero eigenvalues where k is the number of hidden states and thus the rank of the transformed matrix. It seemed to me that the problem I was having is that when the algorithm performs the singular value decompsoition to make the pairs matrix invertible, there is some level of numerical inaccuracy that is causing the eigenvalues to change ever so slightly in the negative or complex direction. I thought that if I just set my algorithm to ignore all eigenvalues whose abolsute value/complex norm is less than one tenth of the kth eigenvalue then I would get decent results, but that did not turn out to be the case, so next I have to figure out how to circumvent this problem, if the problem is numerical accuracy, or what the problem really is, if the problem is not numerical accuracy. In addition, up until now I have been using a subset of the histone mark data and only considered 4 histone marks rather than the 8 that are in the dataset. When I attempted to use all 8 histone marks, thereby making my pairs matrices 2^8 x 2^8 and my triples tensor 2^8 x 2^8 x 2^8, my algorithm ran for 10 minutes before I interrupted it. The algorithm was stuck on the tensor operation that I had implemented last week. The definition of a tensor operation on a matrix from the initial paper is a definition for the i,j,kth element of the tensor and it involves a triple sum, so when implementing the tensor operation there are 6 nested for loops. Apparently this O(n^6) time was holding back the entire algorithm, so next week I also have to figure out how to circumvent that huge runtime cost

After meeting with Dr. Chen this week I found out how to turn the pairs matrix from an almost positive semi-definite matrix into a positive semi-definite matrix, and Dr. Chen helped me see why I was getting such bad results for certain numbers of hidden states. Because taking an Singular Value Decomposition is a key computation in the algorithm, the size of the singular values is important. If there are singular values very close to zero, than their singular vectors are very noisy. After plotting a graph of my singular values I found that some were indeed very close to zero, so including too many hidden states includes those noisy singular vectors and therefore causes the algorithm to yield bad results.I also found out that I should have been using a SVD rather than an Eigendecomposition at one point in the algorithm. Since singular values are always real, and sometimes the eigenvalues had very close to zero imaginary components due to numerical accuracy and noise, using the SVD rather than the Eigendecomposition ensured that I would not get complex probabilities. Finally, in the original tensor power method paper I found a section about saving space and time. In this section the authors talked about how to avoid explicitly constructing the whitened tensor (the O(n^6) cost that was giving me problems last week), so this should decrease the runtime of my algorithm immensely.

My laptop simply does not have enough memory to run the tensor decomposition alrogithm in a reasonable amount of time, so I got access to the research group's server and began to run the algorithm on all of the histone mark data that I have, rather than a subset. Even on the server the algorithm took 3 hours to run for 5 hidden states, and almost 8 hours to run for 15 hidden states. Clearly this is very slow, so I started to see how I could speed up the algorithm. The section that took the most amount of time was the tensor decomposition itself, which requires several tensor-vector multiplications. Since the pairs and triples frequncy matrices and tensors are sparse, and become more sparse the larger the number of histone marks considered, Dr. Chen suggested that I consider using the Sparse module of the SciPy Python package. In order to use this package I had to change how tensor-vector multiplication is computed in the algorithm. This became an extension of my problem from last week: I was still using the element-wise definition of tensor-vector/tensor-matrix multiplication, which was slow. After playing around with the terms of the triple summation a little bit I found a way to write the components of the vector resulting from tensor-vector multiplication as the product of a slice of the tensor and the vector and its transpose. Now that I had the tensor-vector multiplication in this form, I could use the SciPy.sparse module to run the algorithm and I spent some time writing a sparse implementation. I ran into some problems. I wanted to construct the pairs matrices and triples tensor using sparse matrices from the very begining of the algorithm, but I found that constructing a sparse matrix by accessing its elements was much slower than constructing a normal NumPy array by accessing its elements. This was unfortunate because only using sparse matrices significantly saves on memory cost. After I finished the sparse implementation I ran it on the server and was not impressed with the results. The algorithm still ran for a very long time. This seemed strange, since sparce matrix multiplication should be much faster. I tried to look into the problem by writing another implementation that did not use sparse matrices but did use the new way to compute tensor-vector multiplication. This implementation ran circles around the previous algorithms. While the original implementation took nearly 8 hours to run for 15 hidden states, this new implementation ran in 89 seconds, taking only 17-18 seconds per hidden state. This is more than a hundred-fold increase in running speed. I still do not understand why the non-sparse implementation runs so much faster than the sparse implementation, but I believe that it has to do with how NumPy does matrix multiplication versus how SciPy.sparse does sparse matrix with dense matrix multiplication. Regardless, this speed increase seems o be very good news. Next week I plan on comparing my results on the test data with Spectacle's results to see if the algorithm returns good results.

I compared the results of Spectacle with the new algorithm, but I couldn't tell if the results were very different, so I have to find a way to compare the two sets of parameters. I also checked the runtime of the decomposition algorithm while varying the number of hidden states and the number of chromatin marks analyzed. I found that the training time was linear in the number of hidden states and non-linear in the number of chromatin marks. We had our presentations this week, so I worked on that. I created some graphs of the running time and some heat maps of the parameters output by the decomposition algorithm. I also found NumPy's save as csv feature to save all of the runtime data and the parameters for different numbers of hidden states. For some reason the algorithm still does not produce correct results for greater than ~8 hidden states. I'm not sure why this is happening.

I began reading Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions , which is about probabalistic algorithms for matrix factorizations. I also started working on my paper. To deal with the issue that I'm not sure how to compare the output of the tensor decomposition algorithm and Spectacle, it was suggested that I compare the l2 norms of the differences between random premutations of the output of the tensor decomposition algorithm and the Spectacle output. Once I have a distribution of norms, I can compare the norm of the differences between the tensor output ordered in the correct way and the Spectacle output. I did this and found that the norm of the differences after ordering well is much lower than the majority of the norms of the differences after random permutation.

- Tensor Decomposition and Spectral Methods
- Main Tensor Decomposition Paper
- HMMs as Multi-View Models
- Linear Transformation for Multi-View Models
- Hidden Markov Models (HMMs)
- Chromatin and HMMs
- Probabalistic Algorithms