DIMACS
DIMACS REU 2012

General Information

me
Student: Hannah Keiler
Office: 448 CoRE
School: Columbia University
E-mail: hpk2108 [at] columbia [dot] edu
Project: Finding the optimal number of topics for emergency text classification in Higher Order Latent Dirichlet Allocation

Project Description

I am working to find the optimal number of topics to classify text using HO-LDA. The text comes from text messages and various social media outlets that were sent or posted in the wake of the 2010 Haitian earthquake. The hope is that one can use HO-LDA as a model to better classify these texts in order to assist in emergency response.


Weekly Log

Week 1:
This week I met with my mentors-- Professor Pottenger and his graduate student, PhD Candidate Christie Nelson. I familiarized myself with the data set we are using, which is over 3000 text messages, as well as learning about the history of the 2010 Haitian Earthquake and of the Ushahidi software. I downloaded the software I will need and started learning how to use the programs to run LDA/HO-LDA. I located and read parts of 11 papers, in an effort to help frame a topic for my project. I worked on a PowerPoint presentation, which I presented on Friday.
Week 2:
I began this week by writing a program for preprocessing the data in Java. I worked to clean the dataset in an effort to start running trials on different topic numbers and sample sizes the next week. I prepared an experimental design posing the problem and coming up with ideas to approach the problem of determining the optimal number of topics. I read papers on topic modeling and read through a tutorial. I also read about different measures of fitness and the MDL principle.
Week 3:
This week I started running some trials, however speed considerations lead us to reconsider the dictionary we were using, so I looked at some features of the words in the dataset to make a better dictionary. I standardized the labels in the dataset, as well. I also read more papers on topic modeling and watched a video tutorial.
Week 4:
This week I found a paper that is very relevant to the question of finding the optimal number of topics and has software with it. I downloaded the software, but am still in the process of downloading all the necessary Matlab and C++ packages to make it work. I finished making the dictionary, and ran the HO-LDA and LDA programs for the dataset, and did some exploratory work on the data to better understand it. I also started helping my graduate student mentor with one of her projects in the same area--topic modeling using LDA vs HO-LDA--but on a different data set. With that, I wrote a program in Java for preprocessing the data, as well as one for making randomized and stratified training samples. Usually, we could put the data in an Excel spreadsheet and use Weka software to do this, but the dataset is so big (up to over 70,000 columns long) that it was difficult to manage in Excel. I began running trials using this newly preprocessed data.
Week 5:
This week was short with a field trip and the 4th of July. I spent most of my time running and analyzing trials the project I started last week. On Saturday, I added an attribute selection step to the preprocessing for that project.
Week 6:
This week was spent entirely working on the project with my graduate student mentor. I ran trials and analyzed data, and helped to prepare it for the paper, which was due this Friday.
Week 7:
Now, I started working on the Haitian text message data project again. I started to relabel the data-- making fewer and more general categories. I also worked on my final presentation, which I presented Thursday. You can find the presentation in the links below.
Week 8:
I spent the first half of my final week at DIMACS preparing my final report, which included my results from the Nuclear Detection project, my results from the 2010 Haitian Earthquake project, and a literature review on finding the optimal number of topics. Additionally, I documented and organized all of the code I wrote over the summer. I continued to go through the Haiti data, in order to start running code again in the coming weeks.

Presentations


Additional Information