ChatGPT Genarated

Building a Modern Data Pipeline Part 6: Data Insights

Andy Sawyer
4 min readMar 3, 2024

--

This is the final part of a six-part series titled ‘Building a Modern Data Pipeline: A Journey from API to Insight’ related to this GitHub repo. It steps through setting up a Data Pipeline and running the pipeline end-to-end on your local machine.

Part 1 was a high level overview, while part 2 stepped through how to download and run the pipeline. Part 3 looks at the configuration of Docker, and part 4 reviewed the Python file used by Airflow to orchestrate our data pipeline. Part 5 looked at the code that runs within the pipeline, and this final post uses Jupyter notebooks to access the data that has been saved in our tables.

What is Jupyter?

It’s not the planet! Jupyter Notebooks is an open-source web application that lets you to create and share documents containing live code, equations, visualizations, and narrative text. Jupyter Notebooks are widely used in data science, scientific computing, and machine learning for data cleaning and transformation, data visualization, and machine learning, among other applications.

The key feature of Jupyter Notebooks is its ability to combine executable code, rich text, multimedia resources, and visualizations in a single, interactive document. This makes it a great tool for exploratory data analysis, documentation, teaching, and sharing research results. Notebooks can be easily shared between users or published online.

How do I access it again?

Assuming you have run docker-compose up and have run the pipelines through Airflow, your data should be available in the gold bucket in MinIO ready for you to have a look at.

Fire up another tab in your browser, and go to `http://localhost:8888?token=easy`. You’ll see something like this:

Jupyter Lab

If you have a screen that has options to open a new Python notebook, click on the ‘work’ folder on the left hand file browser space, and then open the connect_to_gold.ipynb file.

Stepping through the notebook

I’ve already run the notebook on my machine, so when it loads, you’ll see both the code, but also the output. You can of course run the code again, as well as running different code. I’ve even included a copy of the currency csv file in the Jupyter docker container so you can try loading that directly rather than through the pipeline.

The first cell is our imports. Exactly the same as our Python files, we need to start by importing the packages we intent to use. Here we’re importing Polars and Deltalake. The second cell then defines our storage options. This includes the connection details to the S3 object store, including username and password, as well as the IP address. This is the IP address of the Docker network that was specified in the docker-compose.yml, so you don’t need to worry about changing the IP address to access the data.

We then run through a process of connecting to our three gold tables:

  • dim_currency
  • dim_date
  • fct_rates

Connecting to dim_currency is shown below:

dim_currency

Note that the additional hash column has been added to the original data from the csv. This was done in the transformation step of the pipeline.

How do the tables join?

Ah, the Entity Relationship Diagram. Yes. You’ll find that below :-)

Entity Relationship Diagram

Next Steps

This is the end of this series of blog posts. I really hope that it has been useful and informative. If you liked what you saw, please put a star on the repo to make it easy for others to find, as well as sharing this or any of the previous articles.

I’d be really keen to understand what else you’d be interested in seeing, so drop me a comment and let me know.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/