DIMACS
DIMACS REU 2019

General Information

me
Student: Erica Cai
Office: 444 CoRE
School: Rutgers University
Major: Computer Science
Minor: Political Science
E-mail: erica.cai@rutgers.edu
Project: Ephemeral Messaging

Project Goals

Ephemeral messaging is messaging to a public phone number or email account, where all messages are publicly displayed on a website but disappear after a short period of time. Anyone can have access to these messages. We are interested in exploring them because we have free and large scale access to any text or email messages, and the content of these messages seem to be very different from the content of everyday, conventional messages. After exploring the uses of ephemeral messaging, we aim to explore the security and privacy issues that these messages present.


Weekly Log

Week 1:

On Wednesday, after orientation, I worked on writing Python code using the Beautiful Soup package to scrape data about ephemeral messages from websites. However, my mentors and I agreed that another tool in Python, Scrapy, would be a more effective and comprehensive web scraping solution, especially for dealing with some complex web sites. I spent most of the week learning how to use the Scrapy framework, which involves designing a Spider that understands how a website stores its data, and that is capable of extracting that data. At the end of the week, I finished designing four Spiders that are capable of scraping data from four websites, and storing the data in a tab delimited file.

During the first week, I also completed several administrative tasks. One was getting a CITI certification for human subjects studies, and another was checking the terms of agreements (TOA) of websites that I planned to scrape data from, because some TOAs state that you cannot scrape data from a website.
Week 2:

This week, I finished building the tools necessary for scraping data about ephemeral text messages and emails on websites. These tools involve Python programs that use headless browsers such as Selenium, and use frameworks such as Scrapy. Although some websites present data in a straightforward HTML format which is easier to scrape, most websites present data using Javascript instead of HTML, and many websites have mechanisms to prevent programs from scraping their data directly. For most of this week, I learned how to use the Scrapy framework to a deeper extent and to use Selenium to circumvent the mechanisms that websites have to prevent programs from scraping them in the traditional way. I also worked on formatting and grouping the data that the Python programs collected, and on scheduling the programs to collect data from these web sites periodically, over a week of time. Next week, I am ready to begin data collection and move forward in exploring techniques for data analysis.
Week 3:

This week, I had three major goals, which were to confirm that the scrapers were collecting as much data as possible in a format that is easier to analyze, to officially begin data collection which would last two weeks, and to learn how to analyze the data and its implications by reading research papers. After examining the first goal, my PhD student mentor Gradeigh and I agreed that the scrapers should collect more data, which means collecting the body of email messages in addition to their subject. This task presented many challenges because the websites stored the body of each email in a separate link, on a separate iframe, and in a cluttered format instead of a clean block of text. In addition, the email body could be a picture or text as well as a different language or character set. Scraping the email message body also increased the running time of two of the scrapers from 4 minutes to over 40 minutes, presenting a possible resource issue for the computer that was running the scraper.

Next, beginning data collection on a computer in the Lindqvist lab also had many challenges. For example, the difference in my computer and the lab's computer meant that a website button and text box on my computer was located at a different spot than that on the lab's computer, which my program needed to accomodate. Another major issue with running the scrapers was their slowness, and I worked on improving code efficiency this week. The biggest challenge was that one website prevented my computer from scraping it too many times, but I have resolved that issue for now by running the scraper in incognito mode. I was surprised to face all of these issues, but was able to officially begin the automated data collection. After completing those steps, I started reading papers this week to understand the current research in ephemeral messaging, text messaging, and security, and to make connections among those papers and our project. On Tuesday, I will present my takeaways and ideas about all of the papers I read.
Week 4:

This week, my major goals for the project were to monitor the data collection process closely and resolve any issues that came up, to finish the literature review process of reading and analyzing research papers that are related to this project, and to ask and answer questions about the data in this project by performing data analysis. Last week, the data collection process presented many challenges, and this week, there were more issues. One site changed the way that it stored data, so my code could not collect data from it anymore and I had to change it. One computer that performed automated data collection ran out of resources. Also, I noticed that my program was not storing data correctly for one website and had to adjust the code and reformat the data. Despite these challenges, the computers still collected all of the data necessary for data analysis, at expected time intervals. Regarding the next goal of the project, I presented the major points, implications, and related works about research papers to Gradeigh, and then we discussed some possible questions and data analysis techniques that these papers used, which we could apply to the data from this project. Finally, I started the data analysis part of the project. Most of the data analysis that I did until now involves organizing the data, but I'm really interested in a new part of the analysis, which is automating the labelling process for the over 50,000 text messages and 4,000 email messages that we have so far into 5 categories. This step involves machine learning. So far, I have been able to write code that predicts labels for text messages with 90% accuracy, but hope to improve the predictions next week. I also plan to do much more data analysis next week, to answer more questions about the data that Gradeigh and I discussed, and to discover more topics that the data may give us insights about, such as security.
Week 5:

This week, I planned to finish the process of automatically generating labels to categorize text messages, to finish the second round of literature review, and to answer some of the questions that we asked about the data by performing data analysis and presenting the results in a clear visual format. The first goal, which was generating labels for text messages, took the most time because it presented many challenges. To label the messages, I planned to use a machine learning algorithm, and spent a lot of time experimenting with different tools and algorithms and improving the ones that performed the best. For example, I tried to use simpler machine learning algorithms such as the Gaussian Naive Bayes and kNN algorithms, and more complex ones by designing a multilayer neural network. Although I got the algorithm to predict labels for messages with an average accuracy of 95%, and with the highest accuracy of 98% for 1100 messages, as soon as I added more messages that the algorithm would predict labels for, the accuracy for predicting labels dropped to between 75% and 80%. Since this was not in the range that we wanted and I could not debug how the neural network was training itself, I decided not to use a machine learning algorithm to categorize messages, and to write an algorithm using a heuristic approach. This new algorithm is able to categorize the 147,000 text messages that we have so far in 4 categories with almost 100% accuracy, while leaving behind 200 messages that it cannot assign a category to. Gradeigh and I discussed these techniques of categorizing messages in detail and agreed that the heuristic approach would be the best. This week, I also presented major points and ways of analyzing data from more research papers to Gradeigh, and we decided to do a third round of literature review for research that was related only to extracting information from a large quantity of string data, such as text messages or search queries. Lastly, I began performing data analysis on the text messages by making bar graphs, grouping the data in a tabular format, and performing aggregation operations on them. The way that I am working with Gradeigh on this step is by making graphs and tables, showing them, and receiving feedback about assigning new labels and exploring new relationships. Next week, I plan to delve much deeper into the data analysis step.
Week 6:

This week, I planned to finish refining categories that my algorithm assigned to text messages, further assign subcategories to the text messages, assign categories to email messages, extract insights from all of the data and categories by performing queries on them, and document those insights in a report and presentation. Categorizing and subcategorizing the messages presented many challenges because the over 143,000 messages that we want to categorize has any possible format. For much of this week, I worked on designing algorithms to extract company names and purposes of messages from the data. By implementing these algorithms, I was able to automate the process of labelling messages with one company name out of a list of 1000 company names, and of labelling messages with purposes or possible security issues that they could present. Although the algorithms made mistakes in labelling the data, I worked on continuously improving the algorithms, which now categorize the data in an expected way. Another major challenge that I faced this week was grouping and analyzing the data, and presenting the results in a clear table format. Since my Python environment could not display the tables nicely, and I was more familiar with writing MySQL queries, I decided to import the data into MySQL and write queries there. However, answering the questions that we had about the data required me to write complex queries that were 20 lines long, and I spent time learning how to write them. Next, I worked on analyzing the tables that resulted from the queries and finding patterns and issues among that data. From the queried data, I was able to discuss possible security issues with Gradeigh, and have decided to continue exploring the extent of and more security issues that text messages may present. Finally, I started working on my presentation to DIMACs which will focus on some of the emerging security issues that we have discovered from the data.
Week 7:

This week, I planned to focus on preparing for and delivering the presentation, while writing more queries and creating more graphs to enhance the data analysis part of the project. The biggest challenges for making the presentation was that it was hard to summarize the entire project in 12 minutes, and that it was hard to identify which parts of the project to emphasize. After exchanging Powerpoint versions and feedback with Professor Lindqvist, and running through the slides daily with Gradeigh, I was able to design a compact, clear presentation and presented that version on Friday morning. Through the discussions about presentations, I learned that focusing more on the results of the project and less on the methodology will help to keep the audience engaged. This week, I also began to wrap up the data analysis part of the project, completing a document that contains the questions that we asked about the data, and how we answered them. Further, I was able to support the results with analysis from more data than the 2 1/2 weeks of data analyzed in the presentation. I plan to do finishing touches in data analysis next week to see if there are any other interesting findings, and then begin documentation, writing comments in code, and making annotations of references.
Week 8:

This week, my priority was to finish a rough draft of the final report in the Institute of Electrical and Electronics Engineers (IEEE) format. After getting a lot of advice from Gradeigh about the report writing process, I wrote all of the content in the rough draft except for the conclusion and bibliography. The main challenge that I faced in writing the report was writing up the results section because I planned to convert the results of every query into a graph or nicely formatted table, and there were many queries to do this for. Further, another difficult task was formatting the visuals to make sure that the words in it were big enough and that all of the information necessary for describing them were present. In order to finish the rough draft, I will convert the report into the IEEE format and complete the conclusion and bibliography sections. Next week, I plan to finish a final draft of the paper to submit to DIMACs, and fully wrap up my part of the project because next week is the last week. This involves writing more documentation to give some overview and structure to the files on Github, and fixing any issues with the scrapers in case anyone wants to continue data collection. I finished the research part of this REU and am now focusing on the reflection part.
Week 9:

This week, I planned to finish my final project report for DIMACS, another reflection report for DIMACS about my REU experience, and the documentation for the project which is for helping others understand how to replicate and continue the project. The biggest challenge was trying to make my ideas clear, because I was so familiar with the project and worked on it every day for the past two months, but others have a different perspective of it. I think that I was successful in writing understandable reports and documentation to wrap up the REU experience. Overall, I am glad that we accomplished this much. I met with Gradeigh almost every day again and reflected on the past two months while also discussing future plans for the project. Gradeigh is debating submitting a short paper about the project for review and will let me know what he decides next month. Other than that, we are done with the project.

Presentations


Definitions

Data Scraping: Extracting and storing data that is displayed on a website.
Spiders: An object that you design which will scrape data.
Selenium: A framework that will let your code control the browser.
Scrapy: A framework that will enable you to design a spider.
Ephemeral Messages: Messaging to a public phone number or email account, where all messages are publicly displayed on a website but disappear after a short period of time
Framework: A tool that you can use to help you perform a task so you don't have to start from scratch.


Additional Information

Acknowledgements

Thanks to my mentors Dr. Janne Lindqvist and Gradeigh Clark for their guidance, to the NSF for its funding, and to DIMACS for providing this REU opportunity.