Student: Diana Michalek
School: University of California at Berkeley
Mathematics and Computer Science
Email: dianam@dimax.rutgers.edu
Research Areas: Information Retrieval, Stylistics
Project Name: Authorship Identification
Faculty Advisor: Dr. Paul B. Kantor, Professor II
School of Communication, Information, and Library Studies
Rutgers University

Project Description

I am working on the following problem:
Given a peice of text, a list of possible authors, and samples of text from each of the authors, determine who wrote the text.

Authorship is to be determined only by examining the content and structure of the text rather than something like the name written at the end or an IP address (in the case of email).

The general method of identifying the auhor of a text is:
1) build a style model for each of n possible authors based on texts written by these n authors and
2) determine the author of the unknown text by applying some distance measure or by using a toolkit such as LEMUR (a toolkit for language modeling and information retrieval).

The style model for an author is based on the stylometric features found in the author's text. The following features can all be used to model an authors' style: word length distributions, sentence length distributions, function word frequencies, email structural features such as greetings and html tags, fraction of whitespace, punctuation, and fraction of capitalized text.

It is worth noting that context-dependant features (such as occurence of the word "halibut") do not make good stylometric features because they say more about the topic of a specific text than about an author's ingrained and subconcious writing style.

This project has many applications, from determining who wrote the 12 disputed Federalist papers and deciding whether or not Shakespeare wrote all of his plays, to determining which criminal wrote an email detailing plans to bomb a building.

I am working on this project with fellow REU-er Ross Sowell. More on these applications can be found at Ross's page.


Work

Our final presentation, which was delivered at DIMACS on July 21, 2004 can be found here.

Some tables, graphs, and scripts that we have been working on can be found here (Note: these files are probably only useful to us, as they are generally not documented).

Here is the presentation delivered to the DIMACS REU participants on July 2.


Links

A_ID Project Page: The official DIMACS KDD Author Identification Project Page
MMS Project Page: The DIMACS Monitoring Message Streams Project Page.
Listserv E-mail Archives: Includes an archive of some 2 million e-mail messages taken from 70 different listserv digests.

References

F. Mosteller and D. L. Wallace. Applied Bayesian and Classical Inference. The Case of The Federalist Papers. Springer-Verlag New York Inc., 1984.

M. Corney, Analysing E-mail Text Authorship for Forensic Purposes, 181 pages, 2003