Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection

Anand Prakash
Analytics Vidhya
Published in
6 min readNov 21, 2020

--

I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to import packages like Pandas and PyArrow to help writing transformations.

In this blog post I will walk you through a hypothetical use-case to read data from glue catalog table and obtain filter value to retrieve data from redshift. I would create glue connection with redshift, use AWS Data Wrangler with AWS Glue 2.0 to read data from Glue catalog table, retrieve filtered data from redshift database and write result data set to S3. Along the way I will also mention troubleshooting Glue network connection issues.

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing.

AWS Glue Connection

You will need a glue connection to connect to redshift database via Glue job.

AWS Glue > Data catalog > connections > Add connection

--

--

Anand Prakash
Analytics Vidhya

Avid learner of technology solutions around Machine Learning, Big-Data, Databases. 5x AWS Certified | 5x Oracle Certified.