General Information

Photo of B. D. Knopp
Student: Brian D. Knopp
Office: CoRE 448
School: Montana Tech of the University of Montana
E-mail: bdknopp (at) mtech (dot) edu
Project: Modeling Microtext with Higher Order Learning

Project Description

Microtext classification is an emerging field in machine learning which aims to classify and/or categorize short and informal pieces of text. Utilizing text message data from the 2010 Haiti earthquake, previous work by my project mentors (Dr. Pottenger, and Dr. Nelson) has demonstrated that classification accuracy can be improved over that of traditional classification methods, through leveraging higher-order paths among microtext.

To this end, a new classification algorithm has been developed: Higher-Order Naive Bayes. This classification algorithm is able to utilize higher-order relations and paths by eliminating the independent and identically distributed assumption of traditional Naive Bayes, while retaining the attractive efficiency of the algorithm.

My work for this summer will be to investigate ways in which HONB's performance can be improved for disaster recovery purposes. I will also be investigating the theoretical complexity of both the algorithm and the text classification problem. Finally, if time permits, I shall attempt to explore the existence of a closed-form solution to the number of paths of a given order.

Weekly Log

Week 1:
During my first week at DIMACS I became more familiar with the work of Dr. Pottenger and Dr. Nelson in statistical methods for text classification. The difficulties with classification of microtext are due to many reasons. Microtext typically contains little information, may contain grammatical or spelling errors, and may need to be translated to English. All of these reasons make microtext processing by hand difficult.
Read existing publications on Higher-Order Naive Bayes for classification
Viewed source-code for current HONB implementation
Formulated structure of ongoing classification research
Presented project goals and expected direction
Week 2:
Continued review of existing publications: written 1-page paper summaries of each
Reviewed source-code (in R) for HONB implementation: drafted high-level trace of call tree.
Created experimental design and approach documents outlining plans for project work
Investigated character n-grams
Week 3:
Created automated testing framework for NB/HONB test criteria.
Ran initial tests for non-overlapping character bigrams.
Continued to refine automated testing framework, broadening area of application
Investigated bigram versus class distribution for common bigram removal using Weka.
Week 4:
Continued refinements and maintenance of automated testing framework; introduced greated modularity with current testing plan.
Add Part-of-Speech tagging support to automated preprocessing software.
Run character bigram and trigram tests.
Work around memory issues inherent in R's constructio.
Week 5:
Continue testing bigrams + part-of-speech tags.
Generalise pre-existing output formatting script, making applicable to n-class classification.
Gather results for final tests.
Week 6:
Test for significant improvement in accuracy, precision, recall, and f-beta (for beta=.5,1,2)
Further investigate Ziphian nature of character bigrams and trigrams. Microtext bigrams and trigrams appear to not follow a Ziphian distribution.
Reformulate the ground truth by combining several messages together. Testing HONB performance metrics against those of NB reveals an increase in performance of HONB relative to NB compared to previous runs.
Week 7:
Investigate the complexity of HONB classification to better understand large run-times; The number of higher-order paths are counted.
Collect partial results for bigram runs with part-of-speech tagging.
Utilize MATLAB to validate that the distribution of word stems is Zipfian.
Present the work accomplished in the final presentation meetings.
Week 8:
Write report on total progress up-to the program's conclusion.
Document the use cases of the custom preprocessor and postprocessor, creating an easy to understand information file.


Additional Information