Building An ETL Pipeline With Python and PostgreSQL;

David K
3 min readApr 27, 2023

--

Data Engineering and ETL Pipelines

An Overview Of Data Engineering and Pipelines

Data Engineering is the practice of leveraging data technologies and programming concepts to collect and store data, making it accessible to Data Scientists or Data Analysts. Ideally, any company that ventures into the data space should have a Data Engineer and a Data Scientist. This resource personnel setup harmonizes data operations and ensures the apt distribution of roles to play in achieving end goals.

An ETL pipeline can basically be defined as a piece of code, written specifically to extract data either from one, or multiple sources, transform it (clean the data or change the form the data comes in) and save it into a Data lake, Database or Data-warehouse. The data pipeline we will build in this article will pull jokes from an API, transform the data, and move it into a PostgreSQL database. Building data pipelines is critical, as data consistently needs to be made accessible to analytics teams in a timely fashion. It’s a pivotal skill data professionals should possess. Well, without further ado, let’s jump right in shall we?

API: https://official-joke-api.appshot.com/random_ten

Required libraries for this project: JSON, pandas, requests, sqlalchemy.

You will need to download and install PostgreSQL for this project as well. You can download PostgreSQL via this link. During the setup process, remember to use a database password you can easily recall.

Let’s jump right in!!

# importing relevant packages
import json
import pandas as pd
import requests
from sqlalchemy import create_engine
import configparser


# function to extract data from the API and convert it into a DataFrame
def get_data():
url = r"https://official-joke-api.appspot.com/random_ten"
response = requests.get(url)
data = json.loads(response.text)

# this normalizes the semi-structured data and turns it into a dataframe
dataframe = pd.json_normalize(data=data)
return dataframe


# Defining a function to save the data into a PostgreSQL database
def commit_to_postgres():

# creating a Configparser object
config = configparser.ConfigParser()
# reading the configuration file
config.read('postgres_db_credentials.txt')

# reading credentials from file
username = config.get('Credentials', 'username')
host = config.get('Credentials', 'host')
password = config.get('Credentials', 'password')
port = config.get('Credentials', 'port')
db_name = config.get('Credentials', 'db_name')

engine = create_engine(
'postgresql://{0}:{1}@{2}:{3}/{4}'.format(
username,
password,
host,
port,
db_name
))

# sql syntax to create the table that would hold our data
create_table_query = """
CREATE TABLE jokes_data(
type text,
setup text,
punchline text,
id integer primary key
)
"""

# a raw database connection that allows direct interaction with the database
connection = engine.raw_connection()

# the cursor allows us to execute queries and retrieve results from the database
cursor = connection.cursor()

# creating the table using the cursor
cursor.execute(create_table_query)

# storing the result of the function into a variable
dataframe = get_data()

# pushing the data into the database
for _, row in dataframe.iterrows():
cursor.execute(
"INSERT INTO jokes_data (id, type, setup, punchline) VALUES (%s, %s, %s, %s)",
(
row["id"],
row["type"],
row["setup"],
row["punchline"]),
)

# committing the current transaction to the database
connection.commit()

# closing the cursor
cursor.close()
# closing the connection
connection.close()


# calling our functions
get_data()
commit_to_postgres()

Please find the link to the codes in my Github repository here.

Kindly note that there is a configuration file in the provided repository, which you’d have to edit to include your database credentials.

Additionally, there is a folder named Airflow in the repository. That folder will be used in the next part of this article, where we’d be automating this particular pipeline. Great! Now we have everything clarified, let’s get going.

After correctly cloning the Github repository and running the code above, you should see data from the API reflecting in your Postgres Database as shown below.

In my next article here, we’d go through how to automate this data pipeline with Apache-Airflow. Sounds like some cool stuff! Well, hang around and we’d surely catch up on this later. Until then, Congratulations on making it this far. Connect with me on LinkedIn here and feel free to leave any questions in my inbox.

Thank you!

--

--