Building Data Pipeline Using AWS Glue Studio
A couple of months ago, I did cover how I build a pipeline for batch data from AWS RDS to Google Big Query using AWS data pipeline. Today, I will be covering building different data pipelines using AWS glue.
You can get comprehensive articles regarding the pros and cons between using AWS Glue and AWS Data pipeline service. But for the task I have been working on, there were several limitations that I found out using Datapipeline Service.
First, if there are any schema changes in our RDBMS table (ex: like additional of new column), we have to redefine back our pipeline in Datapipeline service and run back the pipeline. Second, the Datapipeline service did not support parquet format, which is the format that I want since our aggregation queries in the Datawarehouse engine run much faster-using parquet over CSV.
So what exactly is AWS Glue Studio?
AWS Glue Studio provides a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. It helps us to visualize the data transformation workflows and seamlessly run the workflow on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.
Before using Glue Studio, we have to set Data Validation. Since there were a lot of tutorials covering this in-depth, so I will not be covering much. Here is the guide to set one. you need to set up Glue Crawler first, Set IAM role, set the bucket for your Data Catalogue, and so on.
In this article, I will write the approach that I used to build a Glue pipeline inside AWS.
Click on Glue Studio Tab
Click Create New Job
In data source tab, fill up the database that you have to define in Data Catalog. Choose the table
when you click to the ApplyMapping node, you can change the format or drop your field inside the Transform tab
After finish defining your Transformation, click the Data Target node. In here, you will have to define the format of your dataset, the type of compression that you want for your data format, and last the location of your target.
once finish, click Script to see the Spark script produced based on the parameters that you have selected before, you can edit this Spark script later to fit your transformation goal
After creating the job from our Glue Data Studio, in order to trigger the job based on the schedule we want. We need to set it inside ‘Workflow”
Go to Workflow tab under ETL
Create Add Workflow
Give a name to your workflow. Then click add workflow, this will bring you back to the workflow dashboard. Click back on the Workflow that you just created
Now, we need to add a trigger for our pipeline to run. Click Add Trigger
Since we haven't set any Trigger before, just click Add New
For our case, we want the trigger to be time/schedule base on once every day. So we select Schedule as Trigger Type, Frequency Daily, and Start hour at 2 pm
Once finished, you will come across the graph showing the trigger that you have selected with an empty node. From here you can Add the Job that you have created previously. Click the Add Node in the graph and select the Job that you want to be based your Schedule Trigger.
If you have different Jobs that need to be run with the same trigger, you can simply add the jobs by clicking the Trigger node which then pop up a new node
Basically, we have covered how to create a job in Glue Studio then schedule the job. To check the status of our job, you can go to the Workflow dashboard.
I have reached the end of my writing. Thank you for reading up to this point