Entity resolution is a common problem in everyday life. We are presented with a set of characteristics (shaggy brown hair, glasses, ugly tie, rumpled shirt) and wish to determine what we are looking at (hey, it's Bob from Accounting). The average person performs does this sort of thing literally hundreds of times a day, doing everything from identifying people to figuring out what’s being served for lunch, all without giving the problem much in the way of conscious thought.
This project relates specifically to a computerized form of entity resolution known as Author Identification. Given written documents by a set of authors, we wish to match each document to its author. Ideally, we wish to be able to identify an author’s work across a variety of topics, rather than being limited to a single field.
Author Identification has applications ranging from checking for plagiarism in student assignments to tracking conspirators across posting boards scattered around the Internet. Automated entity resolution has also been shown to be useful for tasks like identifying the gender of a writer or telling fiction from non-fiction.
Check out the latest progress here
All of my scripts, as well as the Compass corpus, are available here