Student: Ross T. Sowell
School: The University of the South
Department of Mathematics and Computer Science
Email: rsowell@dimax.rutgers.edu
sowelrt0@sewanee.edu
Home Page: http://arthur.sewanee.edu/rsowell
Research Areas: Information Retrieval, Robotics, Motion Planning
Project Name: Authorship Identification
Faculty Advisor: Dr. Paul B. Kantor, Professor II
School of Communication, Information, and Library Studies
Rutgers University

Project Description

Given a piece of text and a list of possible authors, can it be automatically determined which author wrote the text?

For example, Alexander Hamilton, James Madison, and John Jay each wrote some of the eighty-five essays known collectively as the Federalist Papers. It is known who wrote 73 of them, but the authorship of the other 12 are in dispute. Much work as been done by historians as well as statisticians on this authorship problem, yet the problem still remains unsolved. It is likely that no one will ever know for sure who wrote the disputed papers, but perhaps we could find evidence to support the probable author to a particular degree of certainty.

Let's examine another application of this work. Suppose we have a collection of messages which have been written by Osama Bin Laden. Now, the government receives a threat of terrorism which has supposedly been written by Osama Bin Laden. Can we gather information from our database of authentic messages, and use that information to determine the likelihood of this new message's authenticity?

Numerous other applications exist, but how do we go about solving these types of problems? Well, we must determine what information would be useful to gather from the texts of known authorship. We look for "style markers." These could be a number of things. For example, average sentence length, average paragraph length, or the number of times an author uses the word "the." Perhaps someone is crazy about semicolons or certain abbreviations or is prone to certain spelling errors. All of these things might define someone's writing style, and we can process the texts of known authorship in order to gather such information. Then, we can process the text of unknown authorship and see whether or not it has charactersitics similar to that of the known text.


Work In Progress

Some tables, graphs, and scripts that we have been working on can be found here.

Presentations

We delivered this presentation to the other DIMACS REU participants and their mentors on July 2, 2004.

We delivered our final presentation of the summer to the other DIMACS REU participants and their mentors on July 21, 2004.

Useful Links

Diana's DIMACS Home: Diana is a fellow REU student collaborating with me on this project.

A_ID Project Page: The official DIMACS KDD Author Identification Project Page.

MMS Project Page: The DIMACS Monitoring Message Streams Project Page.

REU Calendar: Calendar of events for the Summer 2004 DIMACS REU.

DIMACS Calendar: DIMACS weekly/monthly calendar of events.

Corney Thesis: Malcom Corney's Master Thesis entitled, "Analysing E-mail Text Authorship for Forensic Purposes," (2003).

Listserv E-mail Archives: Includes an archive of some 2 million e-mail messages taken from 70 different listserv digests.

BBR Software: Download and read an overview of the Bayesian Logistic Regression Software.