Building and Managing Data Pipelines with Python, PostgreSQL & Airflow — Part One

8 min readAug 3, 2021

Introduction

This series of posts will walk through an example ETL project by extracting pet and animal data from several sources such as the Petfinder API and storing the data in a database and other forms to query and analyze it. We will design data pipelines to ingest, transform, and move the data to build an analytics-ready pet and animal data database. The series will begin with introductory examples of getting started with PostgreSQL on your local machine and Amazon Web Services S3 for storing the raw extracted data as part of our data pipelines. Afterward, we will explore how to call an API to get data, transform it, and store it into a PostgreSQL database and S3. Once we have our data pipeline working, we will also introduce Apache Airflow for scheduling and managing our data pipeline jobs.

Before extracting and storing data, we first need to install PostgreSQL and create a database. As a beginning example, we will install PostgreSQL on our local machine to get familiar with it before moving to a cloud solution. This post will walk through how to install PostgreSQL and get started with AWS to begin storing the data extracted in our data pipelines.

Installing PostgreSQL on Mac OS

Installation Methods

There are several methods for installing PostgreSQL on Mac OS, including:

PostgreSQL Installer (Also available for multiple other operating systems)
— Graphical and source code archives of PostgreSQL provided directly by the PostgreSQL organization.
Homebrew
— The Mac OS package manager Homebrew includes a version of PostgreSQL.
Postgres.app
— Another installation approach packaged as a standard Mac OS app.

It is up to the reader to choose which method they prefer. In this post, we will use the first option and install PostgreSQL using the graphical installer provided from PostgreSQL directly. To begin, download the installer for the latest version of PostgreSQL (13.3 at the time of this writing).

1. Open the downloaded file and follow the setup-wizard prompts. PostgreSQL is typically installed by default in `Library/PostgreSQL/13`
2. On the Select Components step, you can leave everything checked or uncheck Stack Builder if you do not require its graphical interface for installing extensions and additional modules for PostgreSQL.
— Note: PostgreSQL makes it simple to install and manage extensions, so it is generally not required to include Stack Builder.
3. A superuser with the username postgres is created when installing PostgreSQL. Create a password for the superuser and remember it as you will need it to connect to the database server later.
4. The default port that the PostgreSQL will listen to is 5432 and shouldn’t need to be changed.
5. The last step should be to choose a locale to be used by PostgreSQL. It is recommended to select the locale of your current operating system.
6. After choosing your installation options, begin and finish the installation wizard to start working with PostgreSQL!

Connecting to the PostgreSQL Server and Creating the Pet Database

There are several methods to connect to the newly installed PostgreSQL server and create a database. The PostgreSQL installation includes pgAdmin, a fully-featured GUI for managing the PostgreSQL server and the databases contained within. However, the more common approach to working with PostgreSQL is its command-line tool psql. Using the psql to work with PostgreSQL and databases will get us more familiar with the tool before moving to a cloud environment.

Open a terminal and enter the following command:

which psql

If you see something like the below, then you are ready to go!

If you receive a message similar to psql: command not found, then there may have been an issue setting psql in your PATH. To remedy this and get psql on your PATH, enter the following command:

locate psql

If this is the first time you’ve installed PostgreSQL, you should only have one returned message that specifies where the `psql` command-line tool is. For example, here is what I receive when running the same command:

My machine has several old installations of PostgreSQL. Still, we can see that the typical location where psql is located is Library/PostgreSQL/13/bin/psql.

To add psql to your PATH, copy the containing directory that psql is located within and enter the following command in your terminal. I would enter Library/PostgreSQL/13/bin in this example; however, you may have a slightly different path if you installed PostgreSQL in a different location.

export PATH=`Library/PostgreSQL/13/bin:$PATH

This will add the psql location to your PATH so the Mac OS will recognize it in the future. You may need to restart your terminal for it to take effect.

After working out any PATH issues, run the following command in your terminal and enter the password you created for the postgres user when you installed PostgreSQL when prompted.

psql -U postgres

After connecting to the PostgreSQL server, your terminal should look similar to the following:

We can now create a database that will store the animal and pet data. The PostgreSQL command to create a database is:

CREATE DATABASE pethub;

Running the meta-command \conninfo, we see the database we are currently connected to is the default database postgres. To switch to the newly created pethub database, we run another meta-command:

\connect pethub

You should then receive an output similar to You are now connected to database “pethub” as user “postgres”.

Lastly, we create two schemas to organize the tables to store the data we will extract and transform.

CREATE SCHEMA animals;
CREATE SCHEMA organizations;

Create a Postgres User for the Pet Database

It is recommended to create users with specific permissions depending on need rather than using the superuser. Therefore we will create a new user that will be used for the pet database project. To create a new user pethub with a password, run the following statement (replacing the string within the single-quotations as the desired password).

CREATE USER pethub with PASSWORD ‘*********’;

The created user will only have the default privileges associated with it. Therefore, we must grant the correct privileges for the pethub user to successfully interact with the pethub database. The pethub user will need to select, create, update, insert, delete, and more on the schemas and tables in the pethub database. We can use the following syntax to grant all the needed privileges by altering the default privileges associated with the user.

ALTER DEFAULT PRIVILEGES IN SCHEMA animals, organizations GRANT ALL PRIVILEGES ON TABLES to pethub;
ALTER DEFAULT PRIVILEGES GRANT USAGE ON SCHEMAS to pethub;

In this section, we installed PostgreSQL on our local machine. We created the database and the database user used in our project! The following section will walk through the basics of getting started with Amazon Web Services (AWS) by creating an account, a new S3 bucket, and a new IAM user for our pet and animal database project.

Getting Started with AWS (Optional)

Note: The following section is optional. It will incur an actual cost and require new AWS users to enter credit card information to use the S3 service. S3 will be used to store the extracted data from online sources in raw form. A user’s local hard drive would also be just as sufficient for the time being as the data is currently not significant (a few megabytes per job). The cost of storing one gigabyte of data (up to 50TB) in S3 is $0.023 per gigabyte, so any cost incurred should be minor.

For brevity, please refer to Amazon’s official documentation on creating and activating an AWS account. After completing the process to create and activate a new AWS account, click on the ‘Services’ dropdown in the top-left corner of the AWS console and select ‘S3’ from the ‘Storage’ section.

Create an S3 bucket

To create a new S3 bucket, click the ‘Create bucket’ button in the top-right of the S3 console and follow the steps on the next screen. The bucket name must be unique across all other bucket names in s3, so choosing something specific to the application is recommended. For example, I named my bucket ‘pethub-data’. You can leave the region as the default or select another region if desired. Ignore the remaining options and click the ‘Create bucket’ bucket at the bottom of the screen. The bucket to store the data we will extract and transform is now ready to go!

As the last step, we will create a new user in the AWS account used specifically for our pet and animal data application.

Create a New IAM User

It is a general best practice to create new users and partition AWS services as required rather than using the root account. To create a new user in your AWS account, go to the IAM console by selecting Services in the top-left portion of the screen and selecting ‘IAM’ from the ‘Security, Identity, & Compliance’ section. On the IAM dashboard, select ‘Users’ from the navigation bar on the left side of the screen, then click the Add users button in the top-right of the Users console.

Name the user something related to the application and select ‘Programmatic access’ as the Access type. Programmatic access gives the user an AWS access key and secret access key to interact programmatically with AWS. It cannot log in to the AWS management console.

Click the ‘Next: Permissions’ button to move to the next step. This step is where we will set the permissions for what the new user can access within the AWS account. The new user only needs access to S3, so select ‘Attach existing policies directly’ and filter policies by ‘s3’. Select ‘AmazonS3FullAccess’ then select the ‘Next: Tags’ button.

The following two steps can be skipped if desired. On Step 4, select the ‘Create user’ button and save the generated credentials. This is the only time the full credentials are accessible.

In this section, we got started with AWS by creating a new user and a bucket that will store the data we extract!

Conclusion

In this post, we walked through setting up a local PostgreSQL server instance and creating a new database, user, and schemas used in our pet and animal data project. We also introduced Amazon Web Services by creating a new account and a new bucket to store raw extracted data and a new user that will be used to interact programmatically with S3 in our application. In the next post, we will start extracting animal welfare organization data from the Petfinder API using Python and storing it in our newly created database and S3 bucket!