Orchestrate ETL pipeline using AWS Glue Workflows

Rishabh Sahrawat
Machine Learning Reply DACH
6 min readJul 19, 2022

In our previous post, you learned about why you should care about ETL pipelines. In this post, I will show how to orchestrate ETL pipelines to automatically process the data once it is available.

Building an ETL pipeline is not a simple process, it requires several tools to be built and all of them must be placed together in a particular fashion to perform a variety of operations. There are many tools available but for the sake of keeping the blog short and straightforward, I will be talking about AWS tools like Glue job, Glue crawler, and Workflow, which are serverless and make processing any amount of data even more seamless. You can set up these tools based on your requirements, i.e., how you want to process the source data. However, running these tools manually is a hectic procedure and can be error-prone. What would be useful is if we could orchestrate such a process so that it could run whenever we want without becoming unreliable. Now, the next challenge is how to automate the built ETL pipeline such that all the different steps run automatically whenever required without requiring any additional manual input.

To build such an orchestrated ETL pipeline, cloud computing comes to the rescue. Cloud Computing comes with various advantages such as automatic scalability, less management, and unlimited storage. Thankfully, these services are reasonably priced. With services like AWS Glue Jobs and Crawlers, building ETL pipelines is simple and straightforward.

In this blog, we will see a simple way of ETL orchestration using AWS Glue workflows. Glue workflows provide a visual and programmatic tool to manage data pipelines consisting of Glue jobs or crawlers. A glue workflow consists of one or more task nodes arranged as a graph, where a node can be a glue job, crawler, or trigger. As the workflow runs each node, it records its execution progress and status.

The triggers within glue workflows are used to trigger/ start glue job or crawler. They are also used to trigger the workflow itself. In that case, they are referred to as a Start trigger. There are four types of triggers:

1. Schedule: This type of trigger uses a cron expression with which we can start the workflow on an hourly, daily, weekly, monthly basis, etc.

2. Event: These look for the statuses of the previous steps. Such get triggered only when the received status is Succeeded, Failed, Stopped, Timeout.

3. On-demand: The trigger gets triggered manually to start the workflow.

4. EventBridge: The workflow is started when an EventBridge event has occurred. For example, an Object has been uploaded to S3 Bucket.

We can select any of these trigger types according to our requirements.

Now, let’s get our hands dirty! We will create a Glue workflow that will orchestrate a crawler and a Glue job to run daily at 10:11 AM. For this, I am assuming you have already set up the following:

1. AWS account

2. A Glue Job

3. A Glue Crawler

Log in to your AWS Console and search for AWS Glue service. Once you are in Glue Dashboard, go to Workflows under ETL. Click on Add workflow. You will see a page like-

Give your Workflow a name. Optionally, you can write something in Description, and provide Max concurrency value and Tags for this workflow. When done, click on Add workflow in the bottom right.

You will see that my-workflow has been created.

Now select the workflow you just created, and you will see a graph with an Add Trigger button. Click on it. Choose Add new and give your trigger a name like trigger-my-workflow and set Trigger type to Schedule for every day at10:11 AM. You can also choose some other time of your choice.

Congratulations, you have created the first trigger. Now, we want this trigger to start a crawler. After creating our first trigger, we would see a graph with the trigger as the first node. Something like this:

Click on Add node. A new window with two tabs showing a list of Jobs and Crawlers will appear.

After clicking on Crawlers, choose the crawler that you would want to run (I am using user-data-crawler) and click on Add. If you do not have a crawler, then first create a Crawler that would crawl your data.

After following the steps, the new graph would look like this:

The next step is to add the Glue Job. Logically, we would want the Glue job to run only if the Crawler run has been successful. So, we need another trigger that would be triggered once the Crawler run has finished successfully. To add a trigger, click on Add trigger in the graph.

Give the trigger a name (I am using glue-job-trigger) and choose Event as the Trigger type. The values under Trigger type matter if you have more than one event happening in the previous step, for example, if there were two crawlers then we would select Start after ALL watched events meaning it would wait until both crawlers are finished running. As we have only one crawler, we can keep it default. Now, click Add. Since this trigger is not the first trigger, selecting EventBridge event as type would result in an error. Now, the graph will look like this-

To add the Glue job that we want to get triggered, click on Add node and select the Glue job (I am using raw_to_staging_job) from the list appearing in the window and click Add. You should be seeing something like this, which is our final graph.

Congratulations! You now have an automated ETL pipeline that gets triggered daily at 10:11 AM and runs your Crawler and Glue Job.

You can also add more Glue jobs or crawlers as per requirements in a similar fashion to this workflow.

Note: If you want this workflow to run once a particular number of objects have been uploaded to the S3 bucket, you can use the EventBridge type while defining the start trigger (first trigger). In this, the start trigger will wait until a specified number of events have been received or until a specified amount of time (max. 900 seconds) has passed after the arrival of the first event.

In this blog post, you learned how to orchestrate or automate ETL pipelines in an efficient, fast, and reliable manner. We started by creating a Workflow to which we added a trigger, that triggers this workflow at 10:11 AM, daily, and the glue crawler. On a successful run of the crawler, the next trigger gets triggered which starts the glue job for performing the ETL process. When the glue job finishes, the workflow run is also completed. This was all done successfully without a lot of management and code.

--

--