Welcome to Kathryn Davidson's REU 2005 Project Page

Mentor: Endre Boros

Research Area: Machine Learning
My Home School: University of Pennsylvania
My email: kadavids AT sas.upenn.edu
or kadavids AT dimax.rutgers.edu

What does it mean for a computer to "learn"? For example, if a computer is given the medical data of a large number of both sick and healthy patients, could it learn to produce a formula for diagnosing future patients as either healthy or sick? Could it tell us what factors are the most important in making that distinction?

Previously, doctors diagnosed patients based on a relatively small amount of data: a few test results, experience with former patients, and whatever outside knowledge they had acquired. Now, large laboratory experiments and genetic testing provide us with data that is simply to large for the human brain to analyze. Instead we turn to computers to analyze the data for us.

Data such as medical test results are subject to plenty of error (perhaps by the humans that conducted the reading, or maybe mechanical estimates.) We have to allow for error in any formulas created by a computer using this data. I spent the summer investigating ways for the computer to allow the most error tolerance but create the simplest and most useful formulas for making diagnoses.

You can read our paper, Feature Selection and Error Tolerance for the Logical Analysis of Data, about the work that we did this summer.

               

Here is a more basic general outline of how we solved the problem.

Here are my initial presentation (6/23/05) and
final presentation (7/20/05) to the DIMACS REU.


    In June I used Perl to write a program to compute the minimal distance between pairs of positive and negative patients. These can then be fed into an already existing Dualization Algorithm to provide maximal error tolerances.
    In July I wrote a program with my new partner, Craig Bowles, that produces, for a given maximal error tolerance, a table of the attributes that can be used with the given error tolerance to differentiate the positives from the negatives.
    Finally, we analyzed the results from running our program on the Wisconsin Breast Cancer Database, which can be found at the UCIrvine Machine Learning Repository.

I am deeply indebted to Dr. Boros, DIMACS, and the National Science Foundation for this wonderful summer oppporunity. Thank you! Also, thanks to Dennis and Craig for their patience in teaching me their helpful and elegant programming habits!



DIMACS Home Page