||jgeiger [at] reu [dot] dimacs [dot] rutgers [dot] edu
||SemRel for hierarchical, relational, differentially private topic modelling
I am working on SemRel, which is a probabilistic, generative model for text.
It stems originally from Latent Dirichlet Allocation (LDA), which can be used to
identify the topics that a text covers. LDA has been extended by the Pachinko
Allocation Model (PAM) to allow a hierarchy of topics (in which certain are
grouped under others), and by Type-LDA to identify semantic relations. SemRel
combines these hierarchical and relational aspects, and further adds a
guaranteeable level of differential privacy. I am testing the accuracy and
utility of the hierarchical and differentially private aspects of SemRel,
especially in comparison to Type-LDA, using Wikipedia articles.
- Week 1:
- I did background reading on differential privacy, LDA, PAM, Type-LDA, and
relational modeling in general, and talked with Nir Grinberg and Dr. Pottenger,
who developed SemRel, about directions to take for the summer. At the end of the
week, I presented a brief overview of my project
(slides available here).
- Week 2:
- I got the SemRel source code set up on my computer and ran a simple
cross-validation test written by Nir for a corpus of Reuters articles. I discussed
an experimental methodology for evaluating SemRel for accuracy and utility,
with and without differential privacy. We also discussed the possibility of testing
SemRel on microtext. Some final decisions about specific tests to be run still
need to be determined. In addition, I began to look at how to preprocess raw data
into a usable relational tuple format.
- Week 3:
- I downloaded a full copy of current Wikipedia articles. I'm going to start by
working with the featured articles, which comprise approximately 4000 well
developed articles representing a broad range of topics. Wikipedia allows articles
to be downloaded in XML format, with Wiki markup on the actual text. I learned
how to use Sed, and used Sed together with a program I wrote to extract plain
text from the XML and to mark the articles with appropriate information. I then
worked with Nir Grinberg's code to extract relational tuples from the Wikipedia
text using Stanford CoreNLP and MaltParser.
- Week 4:
- I discussed how to categorize the tuples; we chose the manual categorization
used on Wikipedia for featured articles. Associating categories with the tuples
was more difficult than expected, but it worked out in the end. I did some
preliminary visualization of the data in Weka.
- Week 5:
- Nir created more visualizations of the data in R. We discussed the distribution
of the data and whether it would be beneficial to alter the data further to overcome
the sparsity of the data, and to determine methods of removing noise. Dr. Pottenger
suggested adapting a character n-gram approach from another project; I looked at
that code and began to adapt it for our dataset, although we may not use that
particular approach. We decided to aggregate some of the category labels for the
data and to group sparse features together.
- Week 6:
- I considered the categories of vectors with components corresponding to the
count of relation types ("NER pair" types) and used the vector angle between
categories to aggregate them into broader categories; these broader categories
seem to be better distinguished than the old ones. I also added a filter to the
code that runs SemRel to group sparse features into a miscellaneous category. I
ran SemRel on the dataset with 10-fold cross-validation over a variety category
counts. The results were difficult to interpret. I discussed them with Dr.
Pottenger and Nir; it appears we may need to train the model using bootstrapping
or by repeating instances from the training set of documents. In addition, I will
try to find the source of a possible bug in the output statistics (too many
- Week 7:
- Realized the "possible bug" mentioned in last week's entry was not, in fact,
a bug, but rather an artifact of how the program outputs data. No problems here.
I ran a 10-fold crossvalidation on a variety of category counts for both SemRel
and Type-LDA. The results confirmed with high confidence the hypothesis that
SemRel outperforms Type-LDA. However, some anomalies in the output gamma matrix
(i.e., documents-to-metatopics) suggest certain documents are not being modeled
correctly, or else are too isolated from other documents for useful information
to be extracted. Repeated instances of training (as suggested in last week's
entry) does not seem to impact performance on these documents. This effect
merits further examination. At the end of the week, I presented a summary of my
research this summer
(slides available here).
- Week 8:
- The final week—hard to believe it's here already. I organized the code
I've written or modified during the program and wrote a readme to document its
use. I also drafted a final report of my project.