Generated with A.I.

Building a Modern Data Pipeline: A Journey from API to Insight

Andy Sawyer
4 min readFeb 19, 2024

--

The buzz of ‘Big Data’ has passed. Terabytes of data is the new normal, and efficiently managing and processing data is more critical than ever. Companies across industries strive to harness the power of their data, turning raw numbers into actionable insights. This pursuit has led to the evolution of data engineering practices, with an emphasis on building scalable, flexible, and cost-effective data pipelines.

So many times have I seen people ask the question ‘What is a data pipeline?’, and with that framing I set out to build a project to demonstrate the capabilities of a modern data pipeline, utilizing a blend of powerful technologies: #Airflow, #MinIO, #Polars, #DeltaTable, and #Jupyter. This project, available on my GitHub repository, serves as a practical example of architecting a data pipeline from the ground up. It’s complete overkill for the data that’s being loaded, however it provides an end-to-end example of a pipeline.

Project Overview

The goal was to create a demo pipeline that illustrates the end-to-end process of data ingestion, processing, and analytical modeling. Here’s a snapshot of the technology stack and its purpose:

  • Airflow for Orchestration: Automates the workflow, ensuring that data seamlessly transitions through pipeline stages with task scheduling and monitoring. While newer orchestration tools such as Dagster or Mage.ai exist, Airflow is the de-facto solution, and the one we use at #LendiGroup.
  • MinIO for Object Storage: Provides a scalable, S3-compatible storage solution, hosting our data in a set of buckets. I like MinIO as it provides a local setup that you can use to experiment with object stores. Localstack is another popular tool for this.
  • Polars for Data Processing: Allows lightning-fast data processing to efficiently transform and clean data for analytical purposes. While Pandas is the default choice for many, it can struggle with larger datasets. Polars was written in Rust, and based on Apache Arrow. It was designed for speed and scale from the outset.
  • DeltaLake for Data Storage: Leverages DeltaTable format for its robustness in handling both small and large datasets, enabling better data management and query performance. I could have simply stored the data in Postgres, but I wanted to leverage a Data Lakehouse approach for this project.
  • Jupyter for Interactivity: Facilitating data analysis and pipeline prototyping, providing an interactive platform for development and testing. Jupyter notebooks will be familiar to Data Scientists as a convenient way to interact with data.

The Pipeline in Action

The pipeline operates in distinct stages, designed to process exchange rate data from an API, load currencies from a csv, and generate a date table on the fly:

  1. Ingestion: Initially, data is ingested from the API and stored in the ‘bronze’ bucket within MinIO, utilizing the DeltaTable format for its reliability and efficiency.
  2. Transformation: The raw data undergoes transformation, where Polars plays a crucial role in cleaning and preparing the data for further analysis, subsequently storing the refined data in the ‘silver’ bucket.
  3. Modeling: In the final stage, we apply the Kimball methodology to structure the data into a fact and dimension model within the ‘gold’ bucket, optimizing it for insightful analysis and decision-making.

Overcoming Challenges

Every project presents its unique set of challenges, and this one was no exception. I’ve not had a great deal of experience with either DeltaTable format or Polars, so there were some false starts and I’m sure it’s a less than perfect deployment. But by addressing these challenges head-on, I gained valuable insights, which I’ll share in upcoming detailed posts.

Looking Ahead

This post serves as an introduction to the data pipeline project. In future posts, I’ll dive deeper into each component, with posts on the following coming over the next few days:

  • Part 2: Running the DAG: What good is this pipeline if I’ve not shown you how to build and use it? The next post will do exactly this.
  • Part 3: Docker Configuration: Making sure that everything runs the way it should, and there is no need for an engineer to say ‘but it works on my machine’.
  • Part 4: DAG Walk-thru: Airflow pipelines are written in code. This post will look at the specifics of the pipeline for this demo, and how it triggers the actual Python code needed to process the data.
  • Part 5: Pipeline Code-base: A deep dive into the actual code that runs under the hood to move data around and transform it to the point that it is ready for an analyst.
  • Part 6: Data Insights: A look at the included ipynb file, used by Jupyter to perform ad-hoc analysis on the underlying data.

Stay tuned, and let me take you through the repo over the next week or so. Your feedback and questions are highly welcome. Follow me for updates on this series and more insights into the world of data engineering.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data