ETL Data Pipeline In AWS

Published in

Analytics Vidhya

7 min readNov 24, 2020

ETL (Extract, Transform, and Load) is an emerging topic among all the IT Industries. Industries often looking for some easy solution and Open source tools and technology to do ETL on their valuable data without spending much effort on other things.

There is AWS Glue for you, it’s a feature of Amazon Web Services to create a simple ETL pipeline.

AWS Glue Introduction

AWS Glue is another offering from AWS and is a serverless ETL (Extract, Transform, and Load) service on the cloud. It is fully managed, cost-effective service to categorize your data, clean and enrich it and finally move it from source systems to target systems.

AWS Glue consists of centralized metadata repository known as Glue catalog, an ETL engine to generate the Scala or Python code for the ETL Job, and also does job monitoring, scheduling, metadata management. There is no need to manage any infrastructure it is already well managed by aws.

AWS Glue works very well with structured and semi-structured data, and it has an intuitive console to discover, transform and query the data. You can also use the console to modify the generated ETL scripts and execute them in real-time.

Components of AWS Glue

Data Catalog: It is the centralized catalog that stores the metadata and structure of the data.
Database: This option is used to create the database for movement and storing the data from source to target.
Table: This option allows you to create tables in the database that can be used by the source and target.
Crawler and Classifier: A crawler is an outstanding feature provided by AWS Glue.
Job: A job is an application that carries out the ETL task. Internally it uses Scala or Python as the programming language and EMR/EC2 to execute these applications on the cluster.
Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
Development endpoint: The development environment consists of a cluster which processes the ETL operation. it is an EMR cluster which can be then connected to a notebook or to execute the job.
Notebook: Jupyter notebook is an on the web IDE to develop and run the Scala or Python program for development and testing.

Key Feature of AWS Glue

AWS Glue automatically generates the code structure to perform ETL after configuring the job.
You can modify the code and add extra features/transformations that you want to carry out on the data.
AWS crawler, connect to data sources, and it automatically maps the schema and stores them in a table and catalog.

Building ETL Pipeline with AWS Glue

Pre-requisites:

AWS Account
Basic understanding of data and ETL process is required.

AWS Glue is the perfect tool to perform ETL (Extract, Transform, Load) on source data to move to the target.

Steps to create ETL pipeline in AWS Glue:

Create a Crawler
View the Table
Configure Job

Create the S3 Bucket and upload the data source in S3 Bucket.

Create bucket

Upload the data source CSV file in created bucket.

Create a Crawler

Sign in to AWS Console, and from the search option, search AWS Glue and click to open AWS Glue page.
Go to console Add crawler.

3. Once you click Add Crawler then specify the crawler name.

4. After specifying the name, click Next and on the next screen, select the data source and click Next.

5. On the next screen, select the Data Source as S3, and specify the path of the data.

6. Once you fill all the information, click on Next, and in the next section, select No when asked for Add another data source.

7. Once you click Next, it will ask you to create an IAM role to access S3 and run the job. Provide the name of the role and click Next.

8. Once you provide the IAM role, it will ask how you want to schedule your crawler.

9. Once you select the scheduler, click next and create the output database.

10. Once you create the database, the review page will be open. Review all the settings and click finish.

11. Once you click on finish, the crawler will be created, instantly, and it is available for run. Click on Run Crawler, to start execution.

View the Table

Once the crawler is successfully executed, you can see the table and its metadata created in the defined DB. Steps to explore the created table —

On the bottom section of Glue Tutorial, click on Explore table.
This will head you to the tables section, and click on the CSV table created inside flights-db database.
Click on the table, and click View Details.

4. You can see all the information/properties of the table.

5. If you scroll down further, to the same page, you can see the metadata of the table fetched automatically from the file.

Configure Job

In the section, you have to configure the job to move data from S3 to the table by using the crawler.

2. Once you click on Add job, a job configuration page will open up. Fill the required details like name of the job, select the IAM role, type of execution and other parameters.

3. After the configuration click on Next. On the next page select the data source “csv” and click on next.

4. Choose a transformation type — Change Schema.

5. Choose the data target, and select the table “csv” created above.

6. On the next page, you’ll see the source to the target mapping information. Add/delete the columns that you wish to. We suggest keeping the default mapping and click next.

7. Click on the save job, and on the next page, you can see the flow diagram of the job and can edit the script generated.

8. You can use the inbuilt transform feature of AWS Glue to add some predefined transformation to the data.

9. Check the status of the Job, once it is completed, head out to the table section to see the data.

10. Click on View Data, it will open Athena, and preview a few records from the data.

Conclusion

In this blog post we have explained about AWS Glue and how you can create a simple ETL pipeline without any coding effort.