Welcome to Kathryn Davidson's REU 2005 Project Page
My Home School:
University of Pennsylvania
kadavids AT sas.upenn.edu
kadavids AT dimax.rutgers.edu
What does it
mean for a computer to "learn"? For example, if a computer is given the medical data of a
number of both sick and healthy patients, could it learn to produce a formula for diagnosing future patients as either
healthy or sick? Could it tell us what factors are the most important in making that distinction?
Previously, doctors diagnosed patients based on a relatively small
amount of data: a few test results, experience with former patients, and
whatever outside knowledge they had acquired. Now, large laboratory
experiments and genetic testing provide us with data that is simply to
large for the human brain to analyze. Instead we turn to computers to
analyze the data for us.
Data such as medical test results are subject to plenty of error
(perhaps by the humans that conducted
the reading, or maybe mechanical estimates.) We have to allow for error in
any formulas created by a
computer using this data. I spent the summer investigating
ways for the computer to allow the most error tolerance but create the
simplest and most useful formulas for making diagnoses.
You can read our paper, Feature Selection and Error Tolerance
for the Logical Analysis of Data, about the work that we did this summer.
Here is a more basic general outline
we solved the problem.
Here are my initial presentation (6/23/05) and
(7/20/05) to the DIMACS REU.
In June I used Perl to write a program to compute the
minimal distance between pairs of positive and negative
patients. These can then be fed into an already existing
Dualization Algorithm to provide maximal error tolerances.
In July I wrote a program with my new partner, Craig Bowles,
that produces, for a given maximal error tolerance, a table of the attributes that can be used
with the given error tolerance to differentiate the positives from the negatives.
Finally, we analyzed the results from running our program on the Wisconsin Breast Cancer
Database, which can be found at the UCIrvine Machine Learning
I am deeply indebted to Dr. Boros, DIMACS, and the National Science Foundation for this
wonderful summer oppporunity. Thank you! Also, thanks to Dennis and Craig for their
patience in teaching me their helpful and elegant programming habits!