My Top 10 Tips for Working with AWS Glue

Published in

The Startup

10 min readJun 25, 2020

I have spent a significant amount of time over the last few months working with AWS Glue for a customer engagement. For those that don’t know, Glue is a managed Spark ETL service and includes the following high level main features:

Data catalogue (a place to logically organise your data into Databases and Tables)
Crawlers to populate the catalogue
Ability to author ETL jobs in Python or Scala and execute them on a managed cluster (different but I suspect related to the Amazon EMR service)
Workflows to orchestrate a series of Jobs with triggers
And an ability to have a development endpoint to test your Spark jobs.

Each feature has quite a bit of depth to it, and luckily for the most part the AWS Glue documentation is pretty good at explaining everything. If you, like me, would prefer not to fuss with setting up a full EMR environment, and just need standard Spark components then Glue is a great option. You only pay for when your jobs and crawlers are running.

In this post I plan to cover the top 10 things I learnt and some practices that worked for me. It’s kind of a “I wish I knew that when I started” list.

Lets dive in!

TIP # 1 — Make use of the development endpoint feature.

Lets face it — you will not write your Spark ETL job first time without an error (unless you are a freak). And running a job “just to see if it works” gets old fast when it takes up to 10 minutes to even begin executing (cluster provisioning time). A myriad of things can go wrong, syntax errors, IAM policies not being correct, incorrect S3 paths, and so on. To make things more efficient, use a dev endpoint.

A dev endpoint can come in two types, a public or private instance. In general, if it just needs to communicate with S3 buckets a public endpoint will suffice. If you need to access databases on private subnets, or on-premises databases then you need to setup a private endpoint. A private endpoint places an elastic network interface in a nominated private subnet, secured with a security group, so it can route to all the subnets that you normally have available to your VPC. This also means to be able to login to it you need to ssh from an internal host in your VPC (eg: a bastion).

When provisioning you give it a name and IAM role, a ssh public key and a size (and if private some extra networking parameters) and within 5–10 minutes it will be provisioned ready to use. Sizing is measured in DPU’s (Data Processing Units). A DPU for a “standard worker” is 4 x vCPU’s and 16GB memory and gives 2 x Executors. I recommend a size of 2 DPU’s for small testing use cases to keep the costs down, or more DPU’s if you have many files to process — but remember to switch it off when you are finished! With a Spark cluster, 1 DPU is reserved as the master, 1 Executor is used as the driver with the remaining executors used to do the actual work. 2 DPU’s will then give you a single Executor to run spark jobs. Pricing for a development endpoint is the same as for a Glue job.

Note that you can also setup access to the Spark UI on a dev endpoint. You can setup the Spark UI using a CloudFormation template or using a docker container locally. AWS have a super helpful knowledge centre article on this very topic as well.

You can access the dev endpoint in one of three main ways:

ssh to the Python REPL interpreter (I used this the most often)

$ ssh -i privatekey.pem glue@ec2–13–55–xxx–yyy.ap-southeast-2.compute.amazonaws.com -t gluepyspark3

You have a Spark context predefined but not a GlueContext which is pretty important and one of the early things I didn’t realise. So I got into the habit of starting my Glue scripts off like this:

Also with this method by default a bunch of spark warning messages keeps interrupting your session — so you can switch the logging down by entering this command:

>>> sc.setLogLevel(“ERROR”)

2. ssh into the dev endpoint and open a bash shell

$ ssh -i privatekey.pem glue@ec2–13–55–xxx–yyy.ap-southeast-2.compute.amazonaws.com

This method allows you to to aws s3 cp your script file locally and execute it via spark-submit:

$ /usr/bin/spark-submit /usr/lib/spark/examples/src/main/python/pi.py

If you want you can also access the history server with some port forward magic but the supported way is documented here.

$ ssh  -L18080:localhost:18080 glue@ec2-13-55-xxx-yyy.ap-southeast-2.compute.amazonaws.com

3. Use a Zeppelin notebook. This is a little more involved but useful for lots of experiments.

Instructions are here. I ran it in a docker container using WSL 2 on Windows 10 successfully:

$ docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.8.1

Then point your browser at http://localhost:8080. If you use a docker container like I did — you need to make an alteration to the AWS instructions in setting up Zeppelin. Where it says use localhost — use host.docker.internal instead. Within your docker container localhost is localhost within the container — not the parent host your are running your docker container on. Then fire up your ssh tunnel to the remote interpreter:

$ ssh  -vnNT -L :9007:169.254.76.1:9007 glue@ec2-13-55-xxx-yyy.ap-southeast-2.compute.amazonaws.com

I actually find it handy to use Method 1 most of the time to ssh directly in and run the GlueAPI commands and check each step is as I expect, then when it is all working put it into a test ETL job to validate it works as expected in “real-life”.

TIP # 2 — Enable Continuous Logging and Job Metrics

The number one rule of working with data pipelines is log everything! In 2019 there was an overhaul of the logging in Glue and it helped debugging jobs immensely. Switch on continuous logging and job metrics in your job configuration:

Note in the first tip, in the code extract there is this line:

logger = glueContext.get_logger()

The great thing about this is that when you call logger — the text goes into the driver logfile:

logger.info(“partitionpredicate {}”.format(partitionpredicate))

The docs give you some more examples.

Unfortunately, accessing the logs requires a little more work than clicking the Logs hyperlink in the job history. The default Logs hyperlink points at /aws-glue/jobs/output which is really difficult to review. Go to your CloudWatch logs, and look for the log group: /aws-glue/jobs/logs-v2:

Then go in there and filter for your job id:

All your logger outputs will be in the JOB_RUN_ID-driver logs stream. There is also a progress bar stream at JOB_RUN_ID-progress-bar which is handy to review especially for an active job to see how far it is through execution.

Job metrics allows you to have some valuable data to determine your cluster sizing. This page in the docs is very useful for this.

Sample metrics screen for a short job:

TIP # 3 — Understand the Glue DynamicFrame abstraction

A Glue DynamicFrame is an AWS abstraction of a native Spark DataFrame. In a nutshell a DynamicFrame computes schema on the fly and where there are schema inconsistencies it can have multiple types for a field. More details on how they differ are here.

Normally the workflow is to read your source data as a DynamicFrame, ensure schema is consistent, convert to DataFrame using toDF(), then convert back to DynamicFrame using fromDF() prior to writing the processed data out.

Also note that DynamicFrameWriter always appends to existing paritions or creates new ones where they do not already exist.

With a Spark DataFrame you can append, overwrite all data (default) or overwrite specific partitions only. Overwriting specific partitions is particularly handy if you have a daily job where you want to overwrite a partition date with newer data but want to keep all the other partitions untouched. The key to this is to set a Spark config item, partitionOverwriteMode, in your job, then write out as a Spark DataFrame rather than converting back to a Dynamic Frame.

spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')df=df.coalesce(1)df.write.mode('overwrite').partitionBy('date').parquet('s3://datalake/application/app_curated_parquet')

TIP # 4— AWS Crawlers have their place but are not the only way.

Glue crawlers are handy to catalogue your data into a table with minimal fuss, but an alternative that sometimes works better is to use Athena to create your data catalogue table. A problem with crawlers is that sometimes they infer incorrect datatypes which could break workflows. An alternative is to do an initial crawl, then jump into Athena and generate the create table statement automatically using the inferred schema, adjust as needed then put into source control:

Clicking Generate Create Table DDL effectively runs this Athena query:

SHOW CREATE TABLE mytable;

Which gives you the schema that was generated for use in re-creating the table.

Doing this means you do partition crawls after each Spark job that populates the dataset:

MSCK REPAIR TABLE mytable;

Running this command you can wrap into a workflow as a Python shell job (see below for a tip on workflows).

Or as I was researching this post — glue ETL jobs can automatically discover partitions for you now!

Glue does not give you the option to define a table name. By default it uses the top level S3 key where the data its crawling sits. However creating the table using CREATE TABLE via the above method means you can create the table with the required name and point it at a specific S3 location, eg:

CREATE EXTERNAL TABLE `table`(
<< .. table definition ..>>  )<<..other parameters..>>
LOCATION
  's3://datalake/database/20200601_table/'

This allows you to create a test dump of the data as a new version to validate it looks ok, then when you are ready to cutover you delete the old table definition in Glue and then create a new one pointing at the new data.

Using a crawler you cannot switch a table to a different location like this.

TIP # 5— Don’t think you need to have many layers of partitions.

Don’t over partition. With 100’s or 1000’s of partitions there is an overhead that means your queries will probably end up being slower than if you had even a single partition key. Remember if you store your data in parquet format, its super-efficient to query so don’t be scared of bigger parquet files. This talk is great in getting to the nuts and bolts on parquet.

In fact the tips at this AWS Athena Blog are extremely valuable for optimising your Athena performance.

TIP # 6— Use Glue Workflows.

Glue workflows are extremely powerful. A Glue workflow is a construct made up of ETL jobs, triggers and crawlers. This enables you to build up workflows with jobs that run based on the success or failure of previous steps. With appropriate monitoring you can see exactly where steps fail.

State can be passed between each job by adding to a dictionary keyed by the job id which is used as input for the next stage so it knows what to do next. See here for an example.

TIP # 7— Use Python Shell jobs.

Don’t be shy about using straight Python shell Jobs for certain tasks where you don’t need a Spark cluster. There is a bunch of common Python libraries pre-installed as per the documentation but you can also add your own based on your needs.

Great for code that is checking state of an asynchronous running job, and doing other AWS API things or basic machine learning tasks. Startup time for python shell code is almost instant with no startup wait time, unlike a Spark job which can take up to 10 mins to startup. The price is based on 1 minute intervals per 0.0625 DPU rather than 10 minute intervals for Spark jobs so it is a lot more economical as well.

TIP # 8— Make use of bookmarks if you can.

AWS Glue Job Bookmarks are a way to keep track of unprocessed data in an S3 bucket. As long as your data streams in with unique names, Glue behind the scenes (as long as you are using DynamicFrames) will only send to the job the files that need processing. Note that you need to ensure a transformation_ctx=”<<variablename>>” parameter is setup for your calls to the Glue API. <<Variablename>> can be arbitrary — but by convention you make it the name of the current dataframe as per the example in the Job Bookmark documentation:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "relatedqueries_csv", transformation_ctx = “datasource0”)

TIP # 9— Store original data location with processed data.

If processing incoming files from S3 — store the input filename as an additional column with the record. Your data analyst will thank you :-)

Eg:

from pyspark.sql.functions import input_file_name
<code...>
df=df.withColumn('datalake_source_file', input_file_name())

TIP # 10— Consider streaming ETL jobs.

Streaming ETL Glue jobs was added in April 2020. If you have a need for continually streaming data from a source like Kafka or Kinesis this potentially saves a lot of pain and custom code where Glue can update destination files in place for you (all file formats including Parquet). This was most difficult to do previously in a manner that allows near instant querying of the updated data.

This feature also allows custom scripts to be written to perform operations that Apache Spark Structured Streaming engine supports.

In later blogs I will probably deep dive into deeper learning from some of these entries but I hope you learnt something from this list that you didn’t already know!