Member-only story

Extract, Transform, and Enhance: Using Snowflake and Azure OpenAI for PDF Data

Enhancing PDF Data Pipelines with Python, Snowflake, and Open AI

I’ve primarily focused on building data pipelines, creating data models, and developing Power BI dashboards. One of my recent projects involved extracting data from PDF files, loading it into Snowflake, and using OpenAI to rewrite specific columns, with the final results saved into new columns. While this approach isn’t entirely new, it was my first time integrating OpenAI into my work, and I’m excited to share my experience and insights through this blog.

Image by Author

Use Case

Received approximately 2,500 PDF files in blob storage, containing various project details related to tenders across different segments. The goal is to extract key fields from these PDF files and load the data into a Snowflake table within the Staging database. After cleaning the data, the project scope for each entry will be rewritten in a simpler form using Azure Open AI, and this simplified scope will be saved as a new column in a Snowflake table within the Reporting database. This reporting table will then be connected…

--

--

Dhilip Subramanian
Dhilip Subramanian

Written by Dhilip Subramanian

Business Intelligence Consultant | Data Engineer

No responses yet