End to End Data Engineering Project #3— Pt 1/4 : Production Level Migration from S3 to Snowflake using Docker ,DBT and AWS
That title is quite a tongue twister huh? Well, Don’t let the title of this post overwhelm you . This project (like many other mainstream project) is by no means a tough one. If you opened this link in hopes to get started with a data engineering project , then I assure you it will be worth your time and effort. You will learn a LOT ,if you follow along with the steps rather than just scraping through the article . I have always found trial and error to be my best teacher , making mistakes is the best way to learn something quickly .I personally have learnt more through my mistakes rather than sticking to tutorials . I will try to make this as much fun as possible for you and in turn ,all I request you to provide me with feedback(So that I can write better articles for the future! ).
NOTE : If you want jump straight to instructions for the project , then feel free to scroll to the “Instructions” section and get started. For the next few paragraphs , I will be briefing over few important concepts that are fundamental for this article , give a little context to the readers on the “why” , “what” ,“how ” and then jump straight into execution. With that said, lets begin!
Why would you want to learn this project ?
Let’s say you are working in a company like Uber. As a data engineer one of your main tasks would be to move data around constantly in the most efficient manner possible . What is the reason for data migration?
Data crunching
If you have ever worked with S3 bucket , then you would know that its pretty tough to analyse user data through simple queries. You could do that , but its going to be a tough one for sure . You would have much better luck doing that in relational database like drumrolls …..snowflake!
And that is exactly what we will be doing in this project. We will be bringing data from S3 bucket to Snowflake ,all while performing data transformations along the way (using DBT tool).
What is a DBT tool?
DBT (Data Build Tool) is a crucial tool for performing data transformations. Unlike other ETL tools, DBT doesn’t focus much on extraction or loading.
Its main advantage lies in performing transformations using queries. With DBT, you can write SQL queries to extract, clean, and transform raw data into structured, analysis-ready datasets. It enables analysts and data engineers to build and manage data transformation pipelines efficiently.
How are we using DBT for this project?
We are using DBT to stage data from External Tables(Stages) to Internal Table in Snowflake . So basically this is done by DBT tool .
What is Docker?
Imagine you have house . And in your house you have multiple rooms , with each room having its own contents inside.
Docker works in similar way . You have a main OS sitting in your desktop,( In my case its windows) . And within this main OS we have multiple small containers with its own shared memory
Within each of these OS, you can work as if you are working in completely different environment.(Seperate from your own ).
How is Docker useful for my application ?
We will be using docker to package the DBT tool (along with external dependency libraries ) in an Linux based OS and then compile them on a docker image .Docker will maintain a separate memory space for executing DBT related code.
AWS Services
The AWS services will mainly be used for orchestration purposes .
- ECR Repository : Once the docker image is build we will be using ECR to store the docker image that we have build . This will facilitate ease of running containers directly from AWS storage space. One good analogy for this would be , think of ECR as mail box : You put an order for a new video game like God of war on Amazon .Once the package is delivered to your mail box , you can fetch it from there at any moment in time.
Its the same with ECR. Your ECR will act as a repository that has multiple boxes(images) , you can take the image that you want from it at any moment .
2. ECS Repository : Continuing from our earlier analogy , once we recieved our video game, we would want to test it out by playing on our new PS5 console. Think of ECS as PS5 console that will allow you to play the video game .
ECS is responsible for building the image, and creating containers with exclusive memory space that will allow us to run the DBT package.
3. Step functions : Think of Step functions as our schedule .Our schedule dictates how our day goes by. Instead of us having to decide what we want to do , our schedule tells us what needs to be done what point. Its the same for step functions . Instead of ECS manually having to start a task , we can create Step functions to execute a particular task based on custom logic(Whcih are usually dependent on business use cases). You cant keep playing PS5 the whole day now can you ? You get to play it when you’re back from work ,when you dont have any other work or on the weekends.
I hope that gave you a brief idea of what we will be working on . Now that we have a good understanding of the architecture . lets go ahead and code this out!
A few Prerequisites
Make sure that you have the following items pre-configured in your system
1)Python and pip configured, Make sure that the version is 3.10 because I have had issue with 3.8 version of Python while attempting to install dbt .Link for reference : https://www.youtube.com/watch?v=dYfKJMPNMDw&pp=ygUlY29uZmlndXJpbmcgcHl0aG9uIGFuZCBwaXAgb24gd2luZG93cw%3D%3D
2)Visual code editor(Optional)
3)Docker Desktop(For windows) : You can find the link for download right over here : https://www.docker.com/products/docker-desktop/
4)An AWS Account
5)An active Snowflake account
1. Getting started with S3 and Snowflake
Before we start working on this we need create a external table that has data from S3.
- Go to your AWS console to your S3 bucket. Make sure that you have an S3 bucket with the relevant folders and files within it .If you dont have one, then you can create one easily . Create a bucket and folder inside it with a csv file .
This file is a csv file .
Repeating what I just said , because this is crucial for the project: The structure once again is Bucket>Folder>File (CSV file ) .Make sure that you have this structure . And if you dont have one, then create CSV file a file within a bucket.
5) Create a new external role to get access to S3. Go to IAM users
6)Go to users
7) Create User
8) Set a name for the user and then create then click “Next”
9)Attatch policies directly
10)Type S3FullAccess or atleast part of it and “AmazonS3FullAccess” should appear as part of the search results.And then click on next
11) And then create user
12)Select the user role that you have created
13)You should see the permissions
14)If you dont see those permissions . then select Add Permisions and repeat the steps from 10 to 13.
15)Go to roles and select the role that you have created. In my case it snowflake admin user
16) Once you have selected your user we will be creating security credentials so that Snowflake can get access to your S3 bucket . Select “Security Credentials”
17) Scroll down and go to the “Access Keys ” Section . Create Access Key
18)Select the option “Application outside AWS”
19) (Optional) Provide a key tag
20) And hit “Create Access Key”
21) Once you do that you should see you “Access Key” and “Secret Access Key”
22) Copy both these details in a notepad or a text file , because we will be using these details elsewhere . Also note , that once you exit from this page , you will never be able to access your secret access key again . So make absolutely sure that you have both the detials saved in a text file or a sticky note .
23)Navigate to your Snowflake account . And type in the following querry .Please make sure to replace AWS_STG_OP_A with whatever name you are comfortable . But make sure to follow through along with it till the end .
24)Navigate to your Snowflake account . And type in the following querry .Please make sure to replace AWS_STG_OP_A with whatever name you are comfortable . But make sure to follow through along with it till the end .
CREATE OR REPLACE STAGE AWS_STG_OP_A
url='s3://demo-bucket-json-concatenator/aggregate_folder/'
credentials=(
aws_key_id='<Your aws key that you stored from the step 21>'
aws_secret_key='<Your secret access key that you stored from Step 21>')
file_format = (type = CSV);
25)Then Type in the following command to check if the data has been properly inserted
select $1,$2 from @AWS_STG_OP_A;
26)(Informational) Notice the $1 , $2 etc in the above querry? They represent the columns of the staged external table . You need to have exactly the same number of as that of your CSV file columns .
For example : In my CSV file , there are two columns . Hence I have $1 and $2 (Two columns each represented by $1 and $2).
So if you are CSV file has ,lets say, 5 columns . Then you would need to write a querry like this ‘select $1.$2,$3,$4$5 from @AWS_STG_OP_A;’
Bottom Line : Make sure that the number of columns matches the number of $1,$2,$3$4,$5 and so that you inset into your querry? Capiche?
Conclusion
If you were able to follow all the instructions and you understood the purpose of all the tasks, then congratulations 🎉🎉!
You just built your first staging project where you have acquired data from S3 to Snowflake . While this may look like an easy in principle , you will be surprised how many organisations rely on this simple procedure. There are nuances to this .But the bottom line remains the same. Data migration to warehouses are very crucial in a lot of organisations.
Great work! In the next part we will be getting our hands dirty with DBT tool , so stay tuned for the Part 2 !
I hope this article was informative. If you have found it helpful, share and follow for more such content .And reach out to me if you are stuck at any point . Until then
… To infinity and beyond.