AWS Glue | CSV to Parquet transformation | Getting started
AWS Glue is a fully managed serverless ETL service. It makes it easy to discover, transform and load data that would be consumed by various processes and applications. If you want to learn more about AWS Glue then please refer to the video on AWS Glue Overview
In this article, we will go through the basic end-to-end CSV to Parquet transformation using AWS Glue. We will use multiple services to implement the solution like IAM, S3 and AWS Glue. As a part of AWS Glue, we will use crawlers, Data Catalog including Database & Tables and ETL jobs.
Let’s understand the above flow.
- Create a crawler, which will connect to the S3 data store
- Post successful connection, it will infer or determine the structure of the CSV file using a built-in classifier
- The crawler will write the metadata in the form of a table in the AWS Glue Data Catalog
- After populating the data catalog, create the ETL job to transform CSV into parquet
- The data source for the ETL job will be the AWS Glue Data Catalog table and as a part of the transformation, we will apply the mappings. Post transformation, the data will be stored in S3 as a parquet file
Enough of text, let’s jump to the AWS Management Console
Create IAM role
As a first step, create the IAM role to provide the necessary permissions to AWS Glue to access various services. (For ex: S3)
- Go to IAM Management Console → Roles → Create role
- Select Glue as a service & add AWSGlueServiceRole permission
- Create the role
With the above permission, AWS Glue will be able to access the S3 buckets which contain aws-glue as a part of the bucket name for the defined permissions in this policy. Also, it contains additional permissions.
Create S3 Bucket
Create the S3 bucket with the name which contains aws-glue as a part of the bucket name Or else you have to modify the policy in the above step.
Post bucket creation, create the following folder structure
│ └── annual-reports/
│ └── csv_reports
We will create the database in Glue with the same name as annual_reports, and the table name would be csv_reports. However, it is not necessary that the database & table should have the same name as the folders. It’s just that, it is easy to map things with this kind of naming convention.
As a next step, upload the CSV file in the csv_reports folder. I have used the CSV data from here
Go to AWS Glue Console → Click Databases → Add database as follows
Create Crawler to populate Table
The table can be created manually or via a crawler. The crawler is a program that connects to the data store, infers structure/schema using a built-in or custom classifier and writes the Metadata as a table.
To create a crawler → Click on Crawlers from the left panel → Configure the details as follows and create the crawler
After creating the crawler, run the crawler and if everything is configured as above, it will populate the table under the selected database.
Create ETL job
Once the table is populated, create the ETL job to transform & load the CSV file as a parquet file.
To create an ETL job → Click on Jobs from the left panel
To create a job, there are a couple of options and we will select Visual with a blank canvas. Apart from that it also allows you to write & upload the python shell scripts, spark scripts and interactive development with jupyter notebook. So, you can try these options.
Here we have two side-by-side views, Visual on the left side and its respective options on the right side.
Click on Source under Visual and select the data source. In our case, it’s AWS Glue Data Catalog. Select the node and configure the respective properties on the right side which are Database & Table
The next step is the transformation, click on Transform → Select Apply Mapping or any other transformation based on the requirements. For the purpose of this article, Apply Mapping is selected.
Finally, click on Target → select S3 to load the data as a target data store. Configure the properties.
After this configuration, you can check the generated script under the Script tab. The script can be modified as per the requirement. As a next step, click on Job details and update the job name and select the IAM role. Save the job and click on Run.
Check the job Run details under Runs and check the CloudWatch logs.
At this point, the CSV to Parquet transformation is successful. Check the output in the S3 Target Location configured in the above step.
For a detailed step-by-step tutorial and the implementation of the above use case please refer to the below video.
If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.