Data Engineering — ETL Project with Amazon S3, SnowFlake and Power BI.

Musili Adebayo
Art of Data Engineering
6 min readJan 16, 2024
This is a rectangle shaped illustration of an extract, transform and load project of the UK Licensed Sponsors.
ETL Projects of UK Licensed Sponsors.

This is an end-to-end project with two main goals.

  1. To Extract, Transform, and Load (ETL) a Flat file from source through Amazon AWS S3 storage and load it into Snowflake and final visualization in Power BI.
  2. The second aim of this project is to build a website visualization in Power BI so that end users can easily search for Companies with Tier 2 Visa sponsorship in the United Kingdom(UK).

So, let’s get started.

Step 1: For the key part of this project, we will set up a free trial account for both Snowflake and AWS. You can get the flat file used in this project at GOV.UK website.

Starting with AWS;

Step 2: Creating S3 Bucket in AWS.
Create a unique S3 bucket named musilisnowflake3bucket where all our files will be stored in AWS.

This is a rectangle shape picture that demonstrate how to create an S3 storage Bucket in AWS.
Creating S3 Bucket.

Step 3: Creating a Connection between Snowflake and AWS.
This process will involve creating a role (snowflake_sponsor_access_role) using the Identity and Access Management (IAM) in AWS which will help us to securely provide access to AWS services from Snowflake. This is because, to read data from or write to an S3 bucket, the security and access management policies on the bucket must allow Snowflake to access the bucket. For this project, AmazonS3FullAccess was the permission policy used for creating the role.

This is a rectangle box that shows the Identity and Access Management (IAM) overview in AWS.
Creating a Role in AWS Identity and Access Management (IAM).

More about how I created the Snowflakesponsor_access_role above:
When you click on the role, you will see the interface below. Click on Another AWS Account, put your AWS Account ID, and check the required external ID box. Enter any dummy figure, we will change it later as we go along to give access to Snowflake. When you click Next Permission, You should select AmazonS3FullAccess as the permission policy name.

Image from Snowflake Documentation.

Back to Snowflake;

Step 4: Creating Integration Objects.
At this point, we should be up and running in Snowflake after successfully signing up for the free trial account. the first step is to create an integration object in Snowflake that will allow us to easily fetch data from our S3 bucket into Snowflake. As well as access all information and store credentials from Snowflake to AWS. The result from this SQL script below will provide information on how to connect our Snowflake.

/* Creating the storage integration object in other to have access to 
credentials and connect to AWS S3.*/

// Create storage integration object
create or replace storage integration s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::494962081781:role/snowflakesponsor_access_role'
STORAGE_ALLOWED_LOCATIONS = ('s3://musilisnowflakes3bucket/csv/')
COMMENT = 'Please note that access was only given to the CSV files'

/* Using the DESC INTEGRATION command to retrieve the STORAGE_AWS_IAM_USER_ARN
and STORAGE_AWS_EXTERNAL_ID so we can update it in S3*/

DESC integration s3_integration;

Step 5: Loading and Transforming Data from S3.
1.
Before we load the load data into Snowflake from AWS S3 storage, we need to create a table (sponsor_list) in which the data will be stored.

-- First, we create a table for the data to be loaded
// Create table first
CREATE OR REPLACE TABLE SPONSOR_LIST_DB.PUBLIC.sponsor_list(
organisation_name STRING,
town_city STRING,
county STRING,
type_rating STRING,
route STRING)

2. Creating a file format object and external stage: When you load data files from an S3 bucket into a table, you must describe the format of the file and specify how the data in the file should be interpreted and processed. Due to my personal preference, I created a database named manage_db to keep all file formats and stages in a separate database.

-- Create file format object
CREATE OR REPLACE file format MANAGE_DB.file_formats.csv_fileformat
type = csv
field_delimiter = ','
skip_header = 1
null_if = ('NULL','null')
empty_field_as_null = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"' ;

/* FIELD_OPTIONALLY_ENCLOSED_BY = '"' was used to handle the error
"Field delimiter ',' found while expecting record delimiter '\n'"
while loading CSV data*/

---------------------------------------------------------------------

/* Create a stage(csv_folder) object that references the storage
integration object and the file format object*/
CREATE OR REPLACE stage MANAGE_DB.external_stages.csv_folder
URL = 's3://musilisnowflakes3bucket/csv/'
STORAGE_INTEGRATION = s3_integration
FILE_FORMAT = MANAGE_DB.file_formats.csv_fileformat
COMMENT = 'I only gave access to the CSV file in the S3 Buckets.'

3. Loading the data from the staged file we created earlier into the target table sponsor_list.

-- Load data from staged files into the target table with the Copy command.      
COPY INTO SPONSOR_LIST_DB.PUBLIC.sponsor_list
FROM @MANAGE_DB.external_stages.csv_folder

The result shows that all our data was successfully parsed and loaded with no error seen on the table.

This is a rectangle box that show the result from the table being copied from a external stage to table called sponsor-list.
A successfully loaded Data from S3 Bucket.

4. Let us see a preview of the data.

-- View the Data.      

SELECT * FROM SPONSOR_LIST_DB.PUBLIC.sponsor_list

5. A preview result of the data in the sponsor_list table.

This is a rectangle shape that shows the first 5 previews of the data in the sponsor_list.
A preview of the data in the sponsor_list table.

Step 6: Setting up a Snowpipe
Snowpipe automatically fetches data once it is available in an AWS S3 bucket. Once, the event notification is set up in AWS S3 it automatically triggers the Snowpipe to load data from files as soon as they’re available into a stage in Snowflake.

-- Creating our Pipe Schema for storing all pipes
CREATE OR REPLACE SCHEMA MANAGE_DB.pipes

-- Define pipe
CREATE OR REPLACE pipe MANAGE_DB.pipes.sponsor_pipe
auto_ingest = TRUE
AS
COPY INTO SPONSOR_LIST_DB.PUBLIC.sponsor_list
FROM @MANAGE_DB.external_stages.csv_folder

/*Describe the pipe to see the code to connect the trigger Snowpipe and
create event notification in S3 Bucket*/
DESC pipe sponsor_pipe
  1. The results from the DESC pipe sponsor_pipe command created the code needed to create the connection between Snowpipe and S3 notification.
This is a rectangle shape that shows the notification code for the creating event notification on Snowpipe.
A snowpipe notification_channel for event notification in Amazon S3.

2. Event Notification successfully created in S3 bucket to trigger Snowpipe to automatically load data into Snowflake once it is available. Make sure that you select the Notification type you prefer either for the creation or deletion of objects. Ensure that the SQS Queue is the Destination type and enter the SQS Queue notification _channel code as seen below in the destination.

TA rectangle shape that shows that S3 event notification was successfully set up in AWS.
Event Notification Successfully Set up.

Step 6: Power BI Visualization:
The analytical part of this project is to analyze the data by connecting Snowflake to Power BI to give our end users a website experience with a visualization tool.

  1. The successful connection between Power BI and Snowflake.
This is a Rectangular-shaped object that shows the successful connection between Power BI and Snowflake.
A successful connection between Power BI and SnowFlake.

3. Finally, the interactive Website Dashboard of the Register of Licensed Sponsors who can offer Visa Sponsorship to sponsor workers on the Worker and Temporary Worker immigration routes in the UK.

This is a square-shaped dashboard that looks like a website where users can search for names and locations of sponsors of Tier 2 Visa in the UK.
Website Dashboard of UK Register of Licensed Sponsors.

You can interact with the Website Dashboard here

Conclusion.
This project is an extract, transform, and load of the list of UK Tier 2 Sponsor with Amazon S3, Snowflake, and Power BI to visualize the loaded data.

Thank you for reading.

Remember to subscribe and follow me for more insightful articles and you can reach me via my socials below if you want to discuss further.
LinkedIn: Musili Adebayo
Twitter: Musili_Adebayo

P.S. Enjoy this article? Support my work directly.

--

--

Musili Adebayo
Art of Data Engineering

I am quite the storyteller, once a good idea pops up in my head. I enjoy talking about modern data stack and how to leverage them.