|Student:||Zach J. Wheeler|
|School:||Grove City College|
|E-mail:||zachw.foss at gmail dot com|
|Project:||The Effect of Feature Distribution on Relational Learning|
I'm a rising Senior at Grove City College, studying Mathematics with a minor in Computer Science. My interests have shifted gradually over the course of my undergraduate experience from computers to math, and consequently I have perhaps a much stronger background in the former than my intended career trajectory would suggest. My goal is to obtain a Ph.D. in some branch of pure mathematics, then teach and do research. In my leisure time, I enjoy music, poetry, pick-up soccer, and classic literature.
In the summer of 2013, an REU student stumbled across results that suggested a correlation between how closely a feature set follows Zipf's Law and the performance of "Higher-Order Naive Bayes" (HONB), a recently developed relational learning algorithm. I and Leo Behe are tasked with investigating this correlation in greater detail. We wish to discover whether there is a useful predictive relationship between feature distribution and the performance of relational learning algorithms, and there is also potential that our findings will shed light on the theoretical underpinnings of when and why relational learning is effective as well.
We started an intial sweep of the feature space, beginning to run tests on the datasets previously identified. I also did further research into how to rigorously test for the presence of Zipf's law.
On Friday afternoon, I succeeded in generalizing the formula previously used to count 2nd, 3rd, and 4th -order paths to paths of any order. It turns out that a trick from Order Theory produces the formula, with a simple proof. This solves a problem left open in the 2011 paper that introduced the HONB algorithm.
Additionally, Leo and I gave a second presentation to the CCICADA Research Group, describing the essential goals of our project.
I spent a lot of time this week wrangling with the practical issues of running tests on real world datasets with less-than-perfect code. More reason to study pure mathematics. There were three distinct problems:
First I wrote a script to automatically coagulate test results, also generating and displaying graphs of the feature distribution for each test case. We've already put it to good use. Along the way I discovered that the code formerly used to tabulate results ran the T-tests incorrectly, but that problem is fixed in this new script.
Later in the week I also began thinking about more efficient ways to count higher-order paths. Presently we use exclusively second-order paths, but at some point we want to explore the effectiveness of longer paths. We will most likely need to focus on approximation algorithms, since the problem is #P-complete.
This week I prepared to run more tests, formatted previous test results, fixed minor bugs in the new script, and thought of another angle to approach the theory side of the project from. I may be able to prove a theorem that helps explain why we see superior performance in HONB on the small sample sizes that NB has so much trouble with.
We spent most of this week pulling results together and preparing our final presentation. That presentation is not included here since it contains publishable results. The possibility of a theorem mentioned previously is still present, but investigating it has been postponed.
These weeks I spent in Prague with a few other American students and our Czech collegues. There was little time to work on this project, but what time I had I devoted to preparing our final report and submitting a paper abstract.