Data extraction from PDFs

I learned more about regex that I probably wanted to know
Author

Nenad Bozinovic

Published

May 15, 2024

Repo

The goal of the project was to extract all sorts of data from 11’000+ pdf files that contain California Transportation contracts, available publicly, as part of a large research project. Considering well structured text the first decision was to extract data using Regural expressions i.e. regex, and I relied heavily on regex101.com for prototyping. Initial EDA showed that there were dozens of formatting variations and final logic was a mix of regex and Python as regex alone was not able to cover all the use cases. I learned more about regex that I probably wanted to know, and realized its power and limitations.

One of 100K+ pages to extract:

One of 100K+ pages to extract: Sample contract snapshot

Sample of extracted data: Example of extracted data

Follow this link to run the main.ipynb in Google Colab.

See details at:

Repo