General Information

Student: Bridgette Findley
Office: CoRe 450
School: University of Texas at San Antonio
E-mail: vge991 (commercial at) my.utsa.edu
Mentor: Dr. Trefor Williams
Project: Text Mining and Analysis in Railroad Safety

Project Description

By running different text mining algorithms on a number of transportation safety reports, the intent of this project is to determine if there are any trends in the factors of major railroad accidents. Specifically, I intend to explore Latent Dirichlet Analysis and Latent Semantic Indexing.

Weekly Log

Week 1:
-Orientation and Settling into the apartments
-Reviewed Dr. Williams' papers on his applications of text mining
-Researched text mining algorithms we intend to use
-Prepared for presentation on Friday
Week 2:
-Began reading the text given to me by Dr. Williams (Data Mining for the Masses, by Matthew North)
-Put together code for my project page
-determined the details I need to start using LDA
-began familiarizing myself with Rapid Miner and Jigsaw
-Began importing data to RapidMiner
Week 3:
-ran files through Rapid Miner to determine the most common terms that occur within the files
-ran files through Jigsaw and sorted through the entity sets
-Ran files through Overview. It also gave me the most common terms, but in a more graphical format instead of data
-began reading book reccomendation from Dr. Williams, "Predictive Analytics and Data Mining" by Vijay Koty and Bala Deshpande
-converted all pdf/image files to txt, although I might have to do it again
Week 4:
-copied data into an Excel file in order to run through Stanford Topic Modeler
-ran the basic topic modeler and started the Stanford Topic modeler
-assasinated Archduke Ferdinand
-produced graphs and entities in Jigsaw which support the idea that Truck accidents are a major problem at grade crossings
Week 5:
-saved the Rapidminer wordlist for the NTSB reports
-added Canadian accident reports to the Jigsaw and Rapidminer datasets
-got Stanford's topic modeler to work on the NTSB's reports
-found with the simple topic modeler what the most common and meaningful 'topics' within the Canadian and American reports are
Week 6:
-ran all of the data through Rapid Miner (combined and seperate in different increments) and retreived clusters from it.
-determined which of the clusters pertained to what data and what was shared
Week 7:
-reported back to UTSA about what's going on up here
-started on the presentation we're supposed to give
-figured out tagging in Overview after uploading Canadian Reports
Week 8:
-worked to retrieve emails from a Microsoft SQl database file
-worked with Richard to write a script to correctly format the NTSB files for Stanford's Topic Modeler
-successfully ran all of my data through Stanford's Topic Modeler
-presented my results to CCICADA Faculty
Week 9:
-get ready to return home
-finish reports, proofread a ridiculous number of times
-read more papers to cite in reports
-return materials as needed
-get on a plane on Friday