5 Tips To Successfully Setting Up AWS Glue with Amazon Redshift

Don’t fall into the traps I did…

End of the year usually brings some extra time to work on something that you want to do but don’t have to do right away. For me that has been largely related to and around Amazon Web Services Glue. It’s a fairly new product that introduces a fully managed Extract Transform Load (ETL) solution. It is very dynamic and even generates column mappings from one data store to the new creation and automatically scales, provisions and runs ETL jobs for you without you having to designate resources or scale things up and down to handle different loads.

It also helps generate python scripts that will handle the execution of that job you setup. It’s a pretty slick tool especially if you’ve tried to do any custom ETL solutions beforehand, and it’s dirt cheap as most AWS products seem to be.

As I’ve been setting it up and starting to play with it I ran into a whole bunch of snags with permissions and setup things, it was a bit of a pain so if you are reading this and frustrated, I was too. The errors are pretty clear but they aren’t always that helpful.

#1 AWSGlueServiceRole

This tripped me up more than anything, and of course it was an IAM role that I had to figure out. I kept getting a IAM role error denoting that I didn’t have proper permissions and it wasn’t working, even though I had the GlueServiceRole and S3 permissions attached as policies to the role.

What I later found out is that you have to name it exactly “AWSGlueServiceRole” in order for it to work properly. Something about that prefix is built into AWS systems so they can have permission to create, run and do things. So when you have name fields check and make sure you are naming things correctly..

You can exclude the AmazonRedshiftFullAccess Role if you are not using redshift…

#2 VPC Endpoints

In order for AWS Glue to run it uses s3 for a bunch of things like storing progress on jobs and bookmarking things or creating a place to store the databases it creates.

  • Note; databases in this case are just what they call the output of a crawler where it basically stores the database schema, so all the columns and their corresponding data types.

You need to make sure you have an endpoint in your VPC settings that gives access to s3 so it can all tie together and work properly. It usually defaults to this when it’s setup but make sure you check it here.

search in services for VPC
select endpoints on the left hand column
check your specified endpoint to make sure s3 is enabled, you can check the policy too

#3 Add Your Subnet Groups To Your Route Table

This was the real kicker, the last thing I had to change to get it all to work. It was one little setting that I had to read through docs to find and figure out how to change. It’s quite simple too. I won’t bore you with ALL the details but basically your VPC endpoint is accessed by your subnet groups and those groups need to be added to a “routing table” to tell the vpc endpoint that it can access the requested resources, in this case s3.

Basically you need to follow these screenshots to double check your settings.

select subnets
navigate to correct subnet group in question, select route table and make sure it’s added to the route table that is referenced in your VPC and your VPC endpoint

#4 Create Self Referencing VPC Security Group

Next up you need to make a VPC security group that is self referencing. Basically that means you need to have a new security group that only references its self to give access to your redshift or db instance. It’s a weird technicality that I honestly don’t fully understand but it’s in AWS documentation and it’s required to work and make it work.

It’s pretty simple you just have to create a new group, and add a new inbound rule that references the name of the vpc security group. Pretty simple to execute.

create the security group
edit the inbound rules to have the source of custom with the name of the security group, the sg-id

#5 Add Self Referencing VPC Security Group to Cluster Properties

Last thing that tripped me up was getting that security group added in the right place. You have to add it as a reference in Redshift for you to be able to connect and have the benefits of the self referencing security group you just set up, otherwise it wouldn’t do you any good.

It’s simple too, just takes a couple clicks.

Navigate to redshift and select your clusters
In the cluster settings choose modify
In the modify settings select the new security group you created. This window is a multi select input so make sure you hold down command (mac) or control (windows) to select both security groups that need access.

Once you’ve done that you should be good to go! You’ll be ready to set up your instances and run crawlers and jobs in glue. You’ll have to add the connection to glue and test it but with those settings above in place everything will work great! Happy ETL’ing.