Twitter offers a window into the minds of millions. What are people talking about? Are there any emerging trends? In this project we will collect and analyse twitter streams, cluster the tweets as well as graphical relations between people, and summarise and visualize the results to get a big picture of what's being talked about. The eventual goal of the project is a website that allow users to enter a search term and get a dynamic, clustered and graphical view of tweets related to the term.
- Week 1:
- I read a paper about time varying graphs, and familiarized myself with the MALLET software package. MALLET is a software package for topic modeling, and I'm planning to use it on twitter data. I've also found a good tokenizer for twitter.
- Week 2:
- I changed my project from "Visualization of Time Varying Graphs" to "Visualizing Twitter Trends", as that seems more related to what I'm working on. I'm working on improving the twitter tokenizer. There are a few challenges here - which include removing common stop words, detecting URLs and emoticons, and seperating punctuation from words. I've also found a google spellchecker program for java which might be of some use in the tokenizer.
- Week 3:
- I wrote a program to take twitter data and input it into MALLET. MALLET outputs the document-topic distribution, and from the document-topic distribution I calculated the similarity between the tweets using the hellinger distance and the KL-distance.
- Week 4:
- I inputted the data I obtained into a program called GraphView, which clusters the data based on similarity. Each node is a tweet, and the edges measure the similarity between the tweets. The issue here is that the graph is a complete graph, and so the program is unable to handle a large amount of tweets. I'm currently working on pruning the graph.
- Week 5:
- I've tried a new method of calculating the similarity between the tweets, which is based on term frequency, and it seems to work better.
- Week 6:
- I've decided to use graphviz instead of GraphView to cluster the graphs, as it's easier to work with. I'm visualizing the graph with gvmap.
- Week 7:
- I improved my tokenizer by adding a language detector to keep only English tweets and shorten the elongation of words. I also worked on my final presentation.
- Week 8:
- I worked on my paper, and documented my code.