About
A data scientist and social scientist walk into a bar... and, well, that's not far off from how this project started.
It was becoming ever more important for me to display my skillset. Was I a data scientist? a data engineer? a machine learning engineer? The punchline here plays on the linguistic conundrum that both plagues and is exacerbated by our own creations.
Whatever we may be, we're leaders in integrating tech with society and culture. This project has been featured in community meetups and upskilling events around Python, SQL, Web scraping, tech ethics, and cloud architecture. It's full potential is yet to be discovered.
The CuyaCourts project is an ETL pipeline, and only engages with the data acquisition and understanding step of the Data Science Life Cycle. It's a scraper that picks up one case at a time from the Cuyahoga County Criminal Court Case Docket and stores the information in an analyzable format - a relational database, available for download.
Similar projects make electronic court records available by CLI Stanford's Big Local News, by API CourtListener, Harvard Law School's Caselaw Access Project or paid subscription The Public Access to Court Electronic Records (PACER).
So if others have already done this, did I just duplicate their efforts? (No.) Is this really the most efficient way? (today, yes.) I address these questions, and more in depth in the FAQs below. Please let me know if you're curious about anything else, or have any additional information to share, by using this form.
One thing I am not - is a front-end engineer, however, I'm closer now than I was yesterday because of this project. Thank you for your patience with this website.
Frequently Asked Questions
At the Cuyahoga County Clerk of Courts website, an individual can obtain all the same information about each case at a rate of ~500 cases per day before their IP address is blocked. This consideration shaped the design of this project.
Python Libraries: Selenium, boto3, SQLAlchemy ORM
Compute: AWS Lambda
Deployment: Github actions, Docker, pgadmin4
The information contained in the CuyaCourts database is provided “as-is” without any warranties or guarantees of accuracy. Please do not rely on this data set to solve personal legal problems.
Another angle for ethics may be: did the process of obtaining these records harm the host website? It did not. Each record was scraped one by one and at a gentle pace so as not to overwhelm the host server - and only after requesting access to the data via FOIA request and email inquiry yielded non-response.
The information contained in the CuyaCourts database is provided “as-is” without any warranties or guarantees of accuracy. Please do not rely on this data set to solve personal legal problems.