General Information

Student: Jacob Geiger
Office: CORE 448
School: Yale University
E-mail: jgeiger [at] reu [dot] dimacs [dot] rutgers [dot] edu
Project: SemRel for hierarchical, relational, differentially private topic modelling

Project Description

I am working on SemRel, which is a probabilistic, generative model for text. It stems originally from Latent Dirichlet Allocation (LDA), which can be used to identify the topics that a text covers. LDA has been extended by the Pachinko Allocation Model (PAM) to allow a hierarchy of topics (in which certain are grouped under others), and by Type-LDA to identify semantic relations. SemRel combines these hierarchical and relational aspects, and further adds a guaranteeable level of differential privacy. I am testing the accuracy and utility of the hierarchical and differentially private aspects of SemRel, especially in comparison to Type-LDA, using Wikipedia articles.

Weekly Log

Week 1:
I did background reading on differential privacy, LDA, PAM, Type-LDA, and relational modeling in general, and talked with Nir Grinberg and Dr. Pottenger, who developed SemRel, about directions to take for the summer. At the end of the week, I presented a brief overview of my project (slides available here).
Week 2:
I got the SemRel source code set up on my computer and ran a simple cross-validation test written by Nir for a corpus of Reuters articles. I discussed an experimental methodology for evaluating SemRel for accuracy and utility, with and without differential privacy. We also discussed the possibility of testing SemRel on microtext. Some final decisions about specific tests to be run still need to be determined. In addition, I began to look at how to preprocess raw data into a usable relational tuple format.
Week 3:
I downloaded a full copy of current Wikipedia articles. I'm going to start by working with the featured articles, which comprise approximately 4000 well developed articles representing a broad range of topics. Wikipedia allows articles to be downloaded in XML format, with Wiki markup on the actual text. I learned how to use Sed, and used Sed together with a program I wrote to extract plain text from the XML and to mark the articles with appropriate information. I then worked with Nir Grinberg's code to extract relational tuples from the Wikipedia text using Stanford CoreNLP and MaltParser.
Week 4:
I discussed how to categorize the tuples; we chose the manual categorization used on Wikipedia for featured articles. Associating categories with the tuples was more difficult than expected, but it worked out in the end. I did some preliminary visualization of the data in Weka.
Week 5:
Nir created more visualizations of the data in R. We discussed the distribution of the data and whether it would be beneficial to alter the data further to overcome the sparsity of the data, and to determine methods of removing noise. Dr. Pottenger suggested adapting a character n-gram approach from another project; I looked at that code and began to adapt it for our dataset, although we may not use that particular approach. We decided to aggregate some of the category labels for the data and to group sparse features together.
Week 6:
I considered the categories of vectors with components corresponding to the count of relation types ("NER pair" types) and used the vector angle between categories to aggregate them into broader categories; these broader categories seem to be better distinguished than the old ones. I also added a filter to the code that runs SemRel to group sparse features into a miscellaneous category. I ran SemRel on the dataset with 10-fold cross-validation over a variety category counts. The results were difficult to interpret. I discussed them with Dr. Pottenger and Nir; it appears we may need to train the model using bootstrapping or by repeating instances from the training set of documents. In addition, I will try to find the source of a possible bug in the output statistics (too many zeros).
Week 7:
Realized the "possible bug" mentioned in last week's entry was not, in fact, a bug, but rather an artifact of how the program outputs data. No problems here. I ran a 10-fold crossvalidation on a variety of category counts for both SemRel and Type-LDA. The results confirmed with high confidence the hypothesis that SemRel outperforms Type-LDA. However, some anomalies in the output gamma matrix (i.e., documents-to-metatopics) suggest certain documents are not being modeled correctly, or else are too isolated from other documents for useful information to be extracted. Repeated instances of training (as suggested in last week's entry) does not seem to impact performance on these documents. This effect merits further examination. At the end of the week, I presented a summary of my research this summer (slides available here).
Week 8:
The final week—hard to believe it's here already. I organized the code I've written or modified during the program and wrote a readme to document its use. I also drafted a final report of my project.


Additional Information