A.I. Generated Banner

Building a Modern Data Pipeline Part 2: Running the DAG

Andy Sawyer
4 min readFeb 20, 2024

--

This is the second part of a six-part series titled ‘Building a Modern Data Pipeline: A Journey from API to Insight’ related to this GitHub repo. It steps through setting up a Data Pipeline and running the pipeline end-to-end on your local machine.

In the following post, I have assumed that you have access to Git (to clone the repo) and Docker, as the pipelines run in a container. If not, downloading and installing these is the first step. The steps provided below work on a Mac, and should work on Linux too, but may require tweaking to work on Windows.

For the Part 1, click here.

Downloading the Repository

First things first, you need a copy of the repository on your local machine. Open your terminal of choice, browse to a location you want to download the code to, and type:

git clone https://github.com/nydasco/data-pipeline-demo.git
Git Clone output

Once done, you should be able to cd into that folder:

cd data-pipeline-demo

There is a README.md file in the root of the repository that steps through what to do next, but it is presented below for completeness.

Get an API Key

There is an API key embedded in the repository. Don’t worry — this isn’t a mistake. But as a free API key that only allows 25 calls per day, I’d recommend you take 30 seconds to get your own.

Go to https://www.alphavantage.co/support/#api-key and fill in the details for a free key of your own. You’ll be provided with an alphanumeric key. You will need to replace the key on line 3 of the pipelines/params.py file with your own key. That’ll make sure you have a full 25 calls, and can run the pipeline a few times yourself.

Build Custom Docker Images

I’ve created custom Docker images for both Airflow and Jupyter. This was to make sure the dependencies I wanted were baked in. You’re going to need to build these before you can run the environment. Don’t worry though! I’ve created a small shell script that does it all for you. Simply run:

./build.sh

from the main repository folder.

./build output

Now that’s done, you should be able to run the container:

docker-compose up

Expect to see lots of scrolling text.

So, what now?

Go to the UI

There are three web-apps that are now available to you. You can get to them through your browser at the following urls:

Airflow: http://localhost:8080

MinIO: http://localhost:9001

Jupyter: http://localhost:8888?token=easy

You can log into Airflow using the username airflow and password airflow.

You can log into MinIO using the username minio and the password minio123.

You shouldn’t need a password to get into Jupyter.

Airflow

When you log into Airflow, you’ll be presented with a screen showing all your DAGs (Directed Acyclic Graph). There is only one:

Airflow main menu

You can click on the little arrow in the Actions menu to run the DAG. You can also click on the DAG name (data-pipeline-demo) to open the DAG and have a look at it in more detail.

DAG overview

We’ll go into this in more detail in the next post.

MinIO

After logging into MinIO you’ll see three buckets: bronze, silver, gold. This follows the Medallion Architecture recommended by DataBricks.

MinIO overview

Currently, there are no data in the buckets, however this will change once the Airflow DAG is run.

Jupyter

Finally, Jupyter is where we can interact with the final models created in the demonstration. From the folder structure on the left, click into the work folder, and open the included file. This is a step through to be able to explore the data that the DAG has created. Note that this won’t work unless you have first triggered the pipeline in Airflow and it has completed successfully.

Jupyter overview

Next Steps

That’s all for this post. The next post will be coming shortly, and will go into more details on the configuration of the custom Docker images and on the configuration in the docker-compose.yml that runs the stack. Stay tuned, and please feel free to share your thoughts. Your feedback and questions are highly welcome. Follow me for updates on this series and more insights into the world of data engineering.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data