Microtext classification is an emerging field in machine learning which aims to classify
and/or categorize short and informal pieces of text. Utilizing text message data from the
2010 Haiti earthquake, previous work by my project mentors (Dr. Pottenger, and Dr. Nelson)
has demonstrated that classification accuracy can be improved over that of traditional
classification methods, through leveraging higher-order paths among microtext.
To this end, a new classification algorithm has been developed: Higher-Order Naive Bayes.
This classification algorithm is able to utilize higher-order relations and paths by
eliminating the independent and identically distributed assumption of traditional Naive
Bayes, while retaining the attractive efficiency of the algorithm.
My work for this summer will be to investigate ways in which HONB's performance can be
improved for disaster recovery purposes. I will also be investigating the theoretical
complexity of both the algorithm and the text classification problem. Finally, if time
permits, I shall attempt to explore the existence of a closed-form solution to the number
of paths of a given order.
- Week 1:
- During my first week at DIMACS I became more familiar with the work of Dr. Pottenger
and Dr. Nelson in statistical methods for text classification. The difficulties with
classification of microtext are due to many reasons. Microtext typically contains
little information, may contain grammatical or spelling errors, and may need to be
translated to English. All of these reasons make microtext processing by hand
- Read existing publications on Higher-Order Naive Bayes for classification
- Viewed source-code for current HONB implementation
- Formulated structure of ongoing classification research
- Presented project goals and expected direction
- Week 2:
- Continued review of existing publications: written 1-page paper summaries of each
- Reviewed source-code (in R) for HONB implementation: drafted high-level trace of
- Created experimental design and approach documents outlining plans for project
- Investigated character n-grams
- Week 3:
- Created automated testing framework for NB/HONB test criteria.
- Ran initial tests for non-overlapping character bigrams.
- Continued to refine automated testing framework, broadening area of application
- Investigated bigram versus class distribution for common bigram removal using
- Week 4:
- Continued refinements and maintenance of automated testing framework; introduced
greated modularity with current testing plan.
- Add Part-of-Speech tagging support to automated preprocessing software.
- Run character bigram and trigram tests.
- Work around memory issues inherent in R's constructio.
- Week 5:
- Continue testing bigrams + part-of-speech tags.
- Generalise pre-existing output formatting script, making applicable to n-class
- Gather results for final tests.
- Week 6:
- Test for significant improvement in accuracy, precision, recall, and f-beta (for
- Further investigate Ziphian nature of character bigrams and trigrams. Microtext
bigrams and trigrams appear to not follow a Ziphian distribution.
- Reformulate the ground truth by combining several messages together. Testing
HONB performance metrics against those of NB reveals an increase in performance
of HONB relative to NB compared to previous runs.
- Week 7:
- Investigate the complexity of HONB classification to better understand large
run-times; The number of higher-order paths are counted.
- Collect partial results for bigram runs with part-of-speech tagging.
- Utilize MATLAB to validate that the distribution of word stems is Zipfian.
- Present the work accomplished in the final presentation meetings.
- Week 8:
- Write report on total progress up-to the program's conclusion.
- Document the use cases of the custom preprocessor and postprocessor, creating an
easy to understand information file.