DIMACS
DIMACS REU 2014

General Information

me
Student: Zach J. Wheeler
Office: CoRE 448
School: Grove City College
E-mail: zachw.foss at gmail dot com
Project: The Effect of Feature Distribution on Relational Learning

About Me

I'm a rising Senior at Grove City College, studying Mathematics with a minor in Computer Science. My interests have shifted gradually over the course of my undergraduate experience from computers to math, and consequently I have perhaps a much stronger background in the former than my intended career trajectory would suggest. My goal is to obtain a Ph.D. in some branch of pure mathematics, then teach and do research. In my leisure time, I enjoy music, poetry, pick-up soccer, and classic literature.


Project Description

In the summer of 2013, an REU student stumbled across results that suggested a correlation between how closely a feature set follows Zipf's Law and the performance of "Higher-Order Naive Bayes" (HONB), a recently developed relational learning algorithm. I and Leo Behe are tasked with investigating this correlation in greater detail. We wish to discover whether there is a useful predictive relationship between feature distribution and the performance of relational learning algorithms, and there is also potential that our findings will shed light on the theoretical underpinnings of when and why relational learning is effective as well.

This project is conducted under the oversight of the CCICADA and DIMACS REUs, and mentored by Dr. Christie Nelson and Dr. William Pottenger.


Weekly Log

Week 1:
We moved in, went to several preparatory meetings and talks, and pulled together an initial presentation for the sake of describing our project to our peers and kickstarting our research. As part of our preparation, Leo and I conducted a literature search and established an intial experimental methodology.
Week 2:
I did further background reading on Zipf's law and other models for natural language distribution, and also sought to understand relational learning and HONB in greater detail. Leo and I began running tests to reproduce previous results, and with our mentors we compiled a list of datasets to pursue.
Week 3:

We started an intial sweep of the feature space, beginning to run tests on the datasets previously identified. I also did further research into how to rigorously test for the presence of Zipf's law.

On Friday afternoon, I succeeded in generalizing the formula previously used to count 2nd, 3rd, and 4th -order paths to paths of any order. It turns out that a trick from Order Theory produces the formula, with a simple proof. This solves a problem left open in the 2011 paper that introduced the HONB algorithm.

Additionally, Leo and I gave a second presentation to the CCICADA Research Group, describing the essential goals of our project.

Week 4:

I spent a lot of time this week wrangling with the practical issues of running tests on real world datasets with less-than-perfect code. More reason to study pure mathematics. There were three distinct problems:

  • HONB expects binomial input, but was being fed multinomial input.
  • Our script for tabulating results was unable to handle numeric class labels.
  • An old version of R was installed on the server we were using, and (bizarrely) the code failed when we tried to use non-numeric class labels.

Week 5:

First I wrote a script to automatically coagulate test results, also generating and displaying graphs of the feature distribution for each test case. We've already put it to good use. Along the way I discovered that the code formerly used to tabulate results ran the T-tests incorrectly, but that problem is fixed in this new script.

Later in the week I also began thinking about more efficient ways to count higher-order paths. Presently we use exclusively second-order paths, but at some point we want to explore the effectiveness of longer paths. We will most likely need to focus on approximation algorithms, since the problem is #P-complete.

Week 6:

This week I prepared to run more tests, formatted previous test results, fixed minor bugs in the new script, and thought of another angle to approach the theory side of the project from. I may be able to prove a theorem that helps explain why we see superior performance in HONB on the small sample sizes that NB has so much trouble with.

Week 7:

We spent most of this week pulling results together and preparing our final presentation. That presentation is not included here since it contains publishable results. The possibility of a theorem mentioned previously is still present, but investigating it has been postponed.

Weeks 8 and 9:

These weeks I spent in Prague with a few other American students and our Czech collegues. There was little time to work on this project, but what time I had I devoted to preparing our final report and submitting a paper abstract.


Presentations