|Email:||ryanaponte00 (at) gmail (dot) com|
|Home Institution:||University of Florida|
|Project:||Homeland Security and Policing Data Analysis|
1. How can text-based machine learning and keyword analysis of criminal court case data provide insight on the changing landscape of digital forensics?
2. How can we more effectively train law enforcement agencies for digital forensics positions?
This week I started on my project. I completed a literature review, reading a lot about latent dirichlet allocation (LDA). LDA is a topic model that will be used to better understand what is required by employers for careers in cybersecurity, digital forensics, computer forensics, and cyber forensics.
I am learning more about LDA. It requires preprocessing, such as the removal of stopwords. One library that removes stopwords is Sklearn, but it seems that one of its stopwords is "computer." I may use Python to do web mining to obtain the text for LDA. I also looked at Burning Glass Labor Insights. The only digital forensics related field is computer forensics. I contacted Cyberseek, but I could not get an email through. Burning Glass only allows the last 30 days of job postings to be downloaded. Burning Glass prohibits web mining, so another source might be necessary.
I started getting Latent Dirichlet Alloation running on a Jupyter Notebook from a Medium article. I have not used a Jupyter Notebook before, but it is really nice to not have to run an entire script after you make only a small change. I was able to get a topic model, but there were some other problems. I will need to do more work in obtaining the data for LDA. I do have some options to get the data from.
I considered several criminal case databases. Thomson Reuters Westlaw appears to be the most effective for downloading a large number of cases. They prohobit webscraping, but up to 100 cases can be downloaded at a time. Public Access to Court Electronic Records also allows the downloading of cases, but it would be costly and potentially more difficult to use. I also made a planned approach. An occupational analysis will be conducted using Burning Glass Labor Insight. After that, I will work with the Westlaw data.
I looked more into compTIA courses. They have three specifically for cybersecurity. I also looked at the NICE cybersecurity framework. I also looked into Lexis Nexis, a paid legal database, but it seems that Westlaw would be a more effective choice. I also looked at the NIST NICE Cybersecurity Workforce Framework. It will be used to help answer the career pathways research question.
I started on the industry pathways of my presentation. Currently, I have four graphics to aid in understanding of what is required for a cybersecurity professional. I also found the NICE work role ID's for the jobs I am lookign at: IN-INV-001, IN-FOR-001, and IN-FOR-002. I have not downloaded any cases yet, but I have confirmed the resource I will use. Findlaw, Cornell LII, and Justia do not inlude court cases, so they cannot be used. CaseText, described as an alternative to Westlaw and Lexis Nexis, would work, but it only allows the downloading of 10 cases at a time. Given that Lexis Nexis is paid and I already have access to Westlaw, Westlaw was the easiest source.
This week I started looking at Thomson Reuters Westlaw cases. We are limiting it to the federal criminal cases since January 2020, as the service is not intended for downloading large numbers of cases. It takes about 5 hours to download a single month of cases. I also ran into a download limit that is going to be a problem. I contacted someone at Rutgers to see if the limit can be increased or removed, as this limit would prevent me from obtaining enough data. As of now, I am downloading under 2,000 cases per day.
Another way to remove the court case download limit might be to contact Thomson Reuters themselves. There is good news: I finished some career pathways graphics, the other part of my project. I continued to download cases, and was able to complete January 2020 - June 2020. It appears that I ran into a different limit than the daily limit. I plan to use Excel and Python to analyze the current text data.
I started on my final presentation, which is due next week. I also started on extracting text data from Westlaw. I tried a tool that combines RTF files. After they are combined into a single file, they would be imported into Excel. Unfortunately, the free version combines only two at a time. It also does some strange things with punctuation, but that is not a problem for the research. At this point, I am not sure how to get the text data in a usable format. The presentation will be primarily on the research question about how to better train law enforcement agencies. My mentor showed me how to get some keyword statistics without importing the RTF files, so there is work to show for both questions.
I would like to thank my mentor, Dr. Christie Nelson, my CCICADA mentors, Dr. Roberts and Dr. Egan, the DIMACS REU program, Intelligence Community Center for Academic Excellence, and the Rutgers Externship Exchange program.