Load Parquet File From AWS S3 to AWS Redshift Using AWS Glue

Muhammad Rivaldi Idris
4 min readOct 3, 2023

--

Amazon Redshift, a powerful data warehousing solution offered by AWS, is a popular choice for organizations looking to store and analyze their data. To facilitate the seamless transfer of data into Redshift, AWS Glue, a fully managed extract, transform, and load (ETL) service, comes into play. In this article, I will explain how to file from AWS S3 into Amazon Redshift using AWS Glue, helping you get the most out of your data warehouse.

Why Use Parquet?

In previous article, I’m already prepare source bucket (contain some files in parquet format) as data source to be loaded into AWS Redshift. Before diving into the technical details, I’ll give some reason why I’m choosing parquet format:

  • Parquet is a columnar storage format designed for efficient data storage and processing.
  • It’s highly compressed and optimized for query performance, making it suitable for big data workloads.
  • Parquet is a self-descriptive format that stores schema information, enabling schema evolution without breaking backward compatibility.

Using AWS Glue for ETL

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It allows to create and manage data pipelines for ingesting data from various sources, transforming it, and loading it into AWS services like Redshift. Here’s how to load Parquet files from S3 to Redshift using AWS Glue:

  • Configure AWS Redshift connection from AWS Glue
  • Create AWS Glue Crawler to infer Redshift Schema
  • Create a Glue Job to load S3 data into Redshift

Create Connection

Let’s define a connection to Redshift database in the AWS Glue service. AWS Glue will use this connection to perform some operations such as Crawler and ETL job. AWS Glue will need the Redshift Cluster, database and credentials to establish connection to Redshift data store.

Create a Glue Crawler

The Glue Crawler is a crucial component for automatically discovering the structure and schema of Parquet files in S3 and table on Redshift. It scans both S3 data and Redshift table data, extracts metadata, and stores this information in the Data Catalog.

  • Create a Glue Crawler for S3 Data

Configure a Glue Crawler to scan and catalog the metadata of Parquet files stored in Amazon S3. When setting up the crawler, specify the S3 path where your Parquet files are located.

  • Create a Glue Crawler for Redshift Data

Set up another Glue Crawler to catalog the metadata of your Redshift table. Configure the crawler to connect to your Redshift cluster and scan the table. You’ll need to provide the Redshift cluster endpoint, database name, username, and password.

Create a Glue Job to load S3 data into Redshift

I’m use the UI driven method to create this job. It will need permissions attached to the IAM role and Redhshift Connection.

Select the Source and the Target table from the Glue Catalog in this Job.

AWS Glue automatically maps the columns between source and destination tables. Once we save this Job we see the Python script that Glue generates. We can edit this script to add any additional steps. Upon successful completion of the job we should see the data in our Redshift database. We can query using Redshift Query Editor.

Result:

--

--