Parse 12 Months Of Credit Card Statements In 3 Minutes
How to use Python to read multi-page PDFs, transform unstructured data and SQL to format the final result in BigQuery.
Currently job searching? Give yourself an edge by developing a personal project using my free 5-page project ideation guide.
One of the first obvious insights a data source yields is its structure. With the exception of rare scenarios, you’ll be able to determine the type and scope of data you’re dealing with as soon as you’re able to return raw data. When building API-based connections, as nearly all data engineers will, being able to successfully return an output in raw form like JSON, is the light bulb moment that signals forward progress.
And while JSON structures can quickly grow complex, introducing nested records and messy data, in my opinion, the most difficult data to deal with comes in one of the oldest formats: Text. Text data provides unique challenges and forces data engineers to channel their most niche SQL skills to complete operations like string matching and writing multi-step regular expressions (regex).
One of the most challenging, nuanced and highest-impact textual data you’ll encounter? Your credit card statement.