A Begginer’s Guide to ETL Using Shell Scripts
As an aspiring data engineer, one of the skills that would make you a top talent is knowing how to build ETL data pipelines. In this article, I will illustrate a simple ETL process using a simple beginner friendly walkthrough of an ETL Using Shell Scripting.
Overview
The resulting script from this walkthough is designed to extract specific information from the /etc/passwd file, transform it into CSV format, and then load it into a table called ‘users’ in a PostgreSQL database.
The script first extracts the user name, user id, and home directory path of each user account defined in the /etc/passwd file using the ‘cut’ command. The extracted data is saved to a text file named “extracted-data.txt”.
Next, the script transforms the extracted data into CSV format by replacing the ‘:’ delimiter with a ‘,’ delimiter using the ‘tr’ command. The transformed data is then saved to a CSV file named “transformed-data.csv”.
Finally, the script loads the transformed data from the CSV file into a table called ‘users’ in the PostgreSQL database. This is done using the ‘psql’ command and the COPY command to copy the data from the CSV file into the table.
Extract data: Introducing ‘cut’ command
Extracting data using ‘cut’ command
The ‘cut’ command is a Linux command-line utility used to extract specific sections of text from a file or a stream of text based on a defined delimiter, such as a space, tab, or comma.
echo "database" | cut -c1–4
echo "database" | cut -c5–8
echo "database" | cut -c1,5
The commands provided above extract data from a string data using the ‘cut’ command. Here’s what each command does:
- echo “database” | cut -c1–4: This command extracts the first four characters of the string “database” and prints them to the terminal. The ‘-c’ option is used to specify the character range to be extracted, which is from the first to the fourth character.
- echo “database” | cut -c5–8: This command extracts the characters from the fifth to the eighth position in the string “database” and prints them to the terminal.
- echo “database” | cut -c1,5: This command extracts the first and the fifth character of the string “database” and prints them to the terminal. The ‘-c’ option is used to specify the character positions to be extracted, which are the first and the fifth characters separated by a comma.
Transform text data: Introducing ‘tr’ command
Transforming data using the ‘tr’ command:
‘tr’ is a filter command used to translate, and/or delete characters. It can be used to translate a set of characters from all lower case alphabets to upper case using the following commands:
echo "Shell Scripting" | tr "[a-z]" "[A-Z]"
echo "Shell Scripting" | tr "[:lower:]" "[:upper:]"
To convert the characters from upper case to lower case. We will just interchange the command like this:
echo "Shell Scripting" | tr "[A-Z]" "[a-z]"
Pictured below is the output of the ‘tr’ commands on the terminal.
We can also use ‘tr’ with ‘-d’ option to delete specified characters. The command example below takes in a string “My login pin is 5634” and returns the string without the digits as shown in the image below..
echo "My login pin is 5634" | tr -d "[:digit:]"
Connecting to a PostgreSQL Database
Start your database. For this lab we will use PostgreSQL — which is a free and open-source relational database management system emphasizing extensibility and SQL compliance. To start the PostgreSQL database, run the following command on the terminal.
start_postgres
Note down the access information presented towards the end of these messages, especially the CommandLine:
A sample commandline displayed looks as given below.
`psql — username=postgres — host=localhost`
Running this command from the shell prompt will start the interactive ‘psql’ client which connects to the PostgreSQL server as shown below.
Create a table
In this exercise we will create a table called ‘users’ in the PostgreSQL database. This table will hold the user account information.
The table ‘users’ will have the following columns:
- username
- userid
- homedirectory
Connect to your database, using this command:
\c <your-database-name>
We are using a database called template1 in this lab so our command for connecting to this database is:
\c template1
Once we are inside the database, our prompt will change from ‘postgres=#’ to ‘template1=#’ as shown below:
We will use the following statement to create a table named ‘users’:
create table users(username varchar(50),userid int,homedirectory varchar(100));
Now we have a table where we will load data. Exit the ‘psql’ client and come back to the Linux shell, run the following command:
\q
Load data into a database using shell commands
Here, we will create a shell script which does the following.
- Extract the user name, user id, and home directory path of each user account defined in the /etc/passwd file.
- Save the data into a comma separated (CSV) format.
- Load the data in the csv file into a table in PostgreSQL database.
Create a shell script in a new file named “csv2db.sh” and paste the following comments that state the objective of the script
# This script
# Extracts data from /etc/passwd file into a CSV file.
# The csv data file contains the user name, user id and
# home directory of each user account defined in /etc/passwd
# Transforms the text delimiter from ":" to ",".
# Loads the data from the CSV file into a table in PostgreSQL database.
Extract required user information from /etc/passwd
In this step, we will extract user name (field 1), user id (field 3), and home directory path (field 6) from /etc/passwd file using the cut command.
The /etc/passwd has “:” symbol as the column separator.
Copy the following lines and add them to the end of the script.
# Extract phase
echo "Extracting data"
# Extract the columns 1 (user name), 2 (user id) and
# 6 (home directory path) from /etc/passwd
cut -d":" -f1,3,6 /etc/passwd
Save the file and run the script using the following command in your terminal
bash csv2db.sh
Replace the cut command at end of the script with the following command:
cut -d":" -f1,3,6 /etc/passwd > extracted-data.txt
Save the file and run the script again. Now you have the data in text file named “extracted-data.txt”. You can open the file to view that data which is separated by ‘:’.
Transform the data into CSV format
The extracted columns are separated by the original “:” delimiter.
We need to convert this into a “,” delimited file.
Add the below lines at the end of the script
# Transform phase
echo "Transforming data"
# read the extracted data and replace the colons with commas.
tr ":" "," < extracted-data.txt
Save and run the script.
After verifying that the output contains ‘,’ as delimiters in place of ‘:’, replace the ‘tr’ command at end of the script with the command below.
tr ":" "," < extracted-data.txt > transformed-data.csv
After saving and running the new version of the script, run the command below to verify that the file ‘transformed-data.csv’ is created, and has the content.
cat transformed-data.csv
Load the data into the table ‘users’ in PostgreSQL
To load data from a shell script, we will use the ‘psql’ client utility in a non-interactive manner. This is done by sending the database commands through a command pipeline to ‘psql’ with the help of ‘echo’ command. We will then add the ‘COPY’ command in our script by adding the following lines to our sell script and save the file.
# Load phase
echo "Loading data"
# Send the instructions to connect to 'template1' and
# copy the file to the table 'users' through command pipeline.
echo "\c template1;\COPY users FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV;" | psql - username=postgres - host=localhost
After running the final version of the script. Run the command below…
echo '\c template1; \\SELECT * FROM users;'
…to verify that the table “users” is populated with the data as shown below.
Thank you for following through. You can find the entire script here.