General Information
Student: |
Erica Cai |
Office: |
444 CoRE |
School: |
Rutgers University |
Major: |
Computer Science |
Minor: |
Political Science |
E-mail: |
erica.cai@rutgers.edu |
Project: |
Ephemeral Messaging |
Project Goals
Ephemeral messaging is messaging to a public phone number or email account, where all messages are publicly displayed on a website but disappear after a short period of time. Anyone can have access to these messages. We are interested in exploring them because we have free and large scale access to any text or email messages, and the content of these messages seem to be very different from the content of everyday, conventional messages. After exploring the uses of ephemeral messaging, we aim to explore the security and privacy issues that these messages present.
Weekly Log
- Week 1:
On Wednesday, after orientation, I worked on writing Python code using the Beautiful Soup package to scrape data about ephemeral messages from websites. However, my mentors and I agreed that another tool in Python, Scrapy, would be a more effective and comprehensive web scraping solution, especially for dealing with some complex web sites. I spent most of the week learning how to use the Scrapy framework, which involves designing a Spider that understands how a website stores its data, and that is capable of extracting that data. At the end of the week, I finished designing four Spiders that are capable of scraping data from four websites, and storing the data in a tab delimited file.
During the first week, I also completed several administrative tasks. One was getting a CITI certification for human subjects studies, and another was checking the terms of agreements (TOA) of websites that I planned to scrape data from, because some TOAs state that you cannot scrape data from a website.
-
- Week 2:
This week, I finished building the tools necessary for scraping data about ephemeral
text messages and emails on websites. These tools involve Python programs
that use headless browsers such as Selenium, and use frameworks such as Scrapy.
Although some websites present data in a straightforward HTML format which
is easier to scrape, most websites present data using Javascript
instead of HTML, and many websites have mechanisms to prevent programs
from scraping their data directly. For most of this week, I learned how to
use the Scrapy framework to a deeper extent and to use Selenium to circumvent
the mechanisms that websites have to prevent programs from scraping them in
the traditional way. I also worked on formatting and grouping the data that the Python programs collected, and on
scheduling the programs to collect data from these web sites periodically, over a
week of time. Next week, I am ready to begin data
collection and move forward in exploring techniques for data analysis.
-
- Week 3:
This week, I had three major goals, which were to confirm that the scrapers
were collecting as much data as possible in a format
that is easier to analyze, to officially begin data collection which would last two weeks,
and to learn how to analyze the data and its implications by
reading research papers. After examining the first
goal, my PhD student mentor Gradeigh and I agreed that the scrapers should collect more data, which means collecting the body of email
messages in addition to their subject. This task presented many challenges
because the websites stored the body of each email in a separate link,
on a separate iframe, and in a cluttered format instead of a clean block of text.
In addition, the email
body could be a picture or text as well as a different language or character
set. Scraping the email message body also increased the running time of two of the scrapers from 4 minutes
to over 40 minutes, presenting a possible resource issue for the computer that was running the scraper.
Next, beginning data collection on a computer in the
Lindqvist lab also had many challenges. For example, the difference in my computer and the
lab's computer meant that a website button and text box on my computer
was located at a different spot than that on the
lab's computer, which my program needed to accomodate. Another major issue
with running the scrapers was their slowness, and I worked on improving code efficiency
this week. The biggest challenge was that one website prevented my
computer from scraping it too many times, but I have resolved that
issue for now by running the scraper
in incognito mode. I was surprised to face all of these issues, but was able
to officially begin the automated data collection. After completing those steps, I started reading
papers this week to understand the current research in ephemeral messaging,
text messaging, and security, and to make connections among those papers
and our project. On Tuesday, I will present my takeaways and ideas about all
of the papers I read.
-
- Week 4:
This week, my major goals for the project were to monitor the data collection process closely and resolve any issues that came up, to finish the literature review process of reading and analyzing research papers that are related to this project, and to ask and answer questions about the data in this project by performing data analysis. Last week, the data collection process presented many challenges, and this week, there were more issues. One site changed the way that it stored data, so my code could not collect data from it anymore and I had to change it. One computer that performed automated data collection ran out of resources. Also, I noticed that my program was not storing data correctly for one website and had to adjust the code and reformat the data. Despite these challenges, the computers still collected all of the data necessary for data analysis, at expected time intervals. Regarding the next goal of the project, I presented the major points, implications, and related works about research papers to Gradeigh, and then we discussed some possible questions and data analysis techniques that these papers used, which we could apply to the data from this project. Finally, I started the data analysis part of the project. Most of the data analysis that I did until now involves organizing the data, but I'm really interested in a new part of the analysis, which is automating the labelling process for the over 50,000 text messages and 4,000 email messages that we have so far into 5 categories. This step involves machine learning. So far, I have been able to write code that predicts labels for text messages with 90% accuracy, but hope to improve the predictions next week. I also plan to do much more data analysis next week, to answer more questions about the data that Gradeigh and I discussed, and to discover more topics that the data may give us insights about, such as security.
-
- Week 5:
This week, I planned to finish the process of automatically generating labels to categorize text messages, to finish the second round of literature review, and to answer some of the questions that we asked about the data by performing data analysis and presenting the results in a clear visual format. The first goal, which was generating labels for text messages, took the most time because it presented many challenges. To label the messages, I planned to use a machine learning algorithm, and spent a lot of time experimenting with different tools and algorithms and improving the ones that performed the best. For example, I tried to use simpler machine learning algorithms such as the Gaussian Naive Bayes and kNN algorithms, and more complex ones by designing a multilayer neural network. Although I got the algorithm to predict labels for messages with an average accuracy of 95%, and with the highest accuracy of 98% for 1100 messages, as soon as I added more messages that the algorithm would predict labels for, the accuracy for predicting labels dropped to between 75% and 80%. Since this was not in the range that we wanted and I could not debug how the neural network was training itself, I decided not to use a machine learning algorithm to categorize messages, and to write an algorithm using a heuristic approach. This new algorithm is able to categorize the 147,000 text messages that we have so far in 4 categories with almost 100% accuracy, while leaving behind 200 messages that it cannot assign a category to. Gradeigh and I discussed these techniques of categorizing messages in detail and agreed that the heuristic approach would be the best. This week, I also presented major points and ways of analyzing data from more research papers to Gradeigh, and we decided to do a third round of literature review for research that was related only to extracting information from a large quantity of string data, such as text messages or search queries. Lastly, I began performing data analysis on the text messages by making bar graphs, grouping the data in a tabular format, and performing aggregation operations on them. The way that I am working with Gradeigh on this step is by making graphs and tables, showing them, and receiving feedback about assigning new labels and exploring new relationships. Next week, I plan to delve much deeper into the data analysis step.
-
- Week 6:
This week, I planned to finish refining categories that my algorithm assigned to text messages, further assign subcategories to the text messages, assign categories to email messages, extract insights from all of the data and categories by performing queries on them, and document those insights in a report and presentation. Categorizing and subcategorizing the messages presented many challenges because the over 143,000 messages that we want to categorize has any possible format. For much of this week, I worked on designing algorithms to extract company names and purposes of messages from the data. By implementing these algorithms, I was able to automate the process of labelling messages with one company name out of a list of 1000 company names, and of labelling messages with purposes or possible security issues that they could present. Although the algorithms made mistakes in labelling the data, I worked on continuously improving the algorithms, which now categorize the data in an expected way. Another major challenge that I faced this week was grouping and analyzing the data, and presenting the results in a clear table format. Since my Python environment could not display the tables nicely, and I was more familiar with writing MySQL queries, I decided to import the data into MySQL and write queries there. However, answering the questions that we had about the data required me to write complex queries that were 20 lines long, and I spent time learning how to write them. Next, I worked on analyzing the tables that resulted from the queries and finding patterns and issues among that data. From the queried data, I was able to discuss possible security issues with Gradeigh, and have decided to continue exploring the extent of and more security issues that text messages may present. Finally, I started working on my presentation to DIMACs which will focus on some of the emerging security issues that we have discovered from the data.
-
- Week 7:
This week, I planned to focus on preparing for and delivering the presentation,
while writing more queries and creating more graphs to enhance the data analysis part of the project. The biggest challenges for making the presentation was
that it was hard to summarize the entire project in 12 minutes, and that it was
hard to identify which parts of the project to emphasize. After exchanging Powerpoint versions and feedback with Professor Lindqvist, and running through
the slides daily with Gradeigh, I was able to design a compact, clear presentation and presented that version on Friday morning. Through the discussions
about presentations, I learned that focusing more on the results of the project
and less on the methodology will help to keep the audience engaged. This week,
I also began to wrap up the data analysis part of the project, completing a
document that contains the questions that we asked about the data, and how
we answered them. Further, I was able to support the results with analysis from
more data than the 2 1/2 weeks of data analyzed in the presentation. I plan
to do finishing touches in data analysis next week to see if there are any other
interesting findings, and then begin documentation, writing comments in code,
and making annotations of references.
-
- Week 8:
This week, my priority was to finish a rough draft of the final report in the
Institute of Electrical and Electronics Engineers (IEEE) format. After getting
a lot of advice from Gradeigh about the report writing process, I wrote all of
the content in the rough draft except for the conclusion and bibliography. The
main challenge that I faced in writing the report was writing up the results
section because I planned to convert the results of every query into a graph or
nicely formatted table, and there were many queries to do this for. Further,
another difficult task was formatting the visuals to make sure that the words
in it were big enough and that all of the information necessary for describing
them were present. In order to finish the rough draft, I will convert the report
into the IEEE format and complete the conclusion and bibliography sections.
Next week, I plan to finish a final draft of the paper to submit to DIMACs,
and fully wrap up my part of the project because next week is the last week.
This involves writing more documentation to give some overview and structure
to the files on Github, and fixing any issues with the scrapers in case anyone
wants to continue data collection. I finished the research part of this REU and
am now focusing on the reflection part.
-
- Week 9:
This week, I planned to finish my final project report for DIMACS, another
reflection report for DIMACS about my REU experience, and the documentation for the project which is for helping others understand how to replicate and
continue the project. The biggest challenge was trying to make my ideas clear,
because I was so familiar with the project and worked on it every day for the
past two months, but others have a different perspective of it. I think that I was
successful in writing understandable reports and documentation to wrap up the
REU experience. Overall, I am glad that we accomplished this much. I met with
Gradeigh almost every day again and reflected on the past two months while
also discussing future plans for the project. Gradeigh is debating submitting a
short paper about the project for review and will let me know what he decides
next month. Other than that, we are done with the project.
Presentations
Definitions
Data Scraping: Extracting and storing data that is displayed on a website.
Spiders: An object that you design which will scrape data.
Selenium: A framework that will let your code control the browser.
Scrapy: A framework that will enable you to design a spider.
Ephemeral Messages: Messaging to a public phone number or email account, where all messages are publicly displayed on a website but disappear after a short period of time
Framework: A tool that you can use to help you perform a task so you don't have to start from scratch.
Additional Information
Acknowledgements
Thanks to my mentors Dr. Janne Lindqvist and Gradeigh Clark for their guidance, to the NSF for its funding, and to DIMACS for providing this REU opportunity.
-