AWS Glue: An ETL Solution with Huge Potential

Published in

Capital One Tech

11 min readApr 15, 2020

blue, red, and white glue bottle with navy smiling face and arms and legs — Your new friend, Glue.

AWS Glue is a relatively new fully managed serverless Extract, Transform, and Load (ETL) service that has enormous potential for teams across enterprise organizations, from engineering to data to analytics. Glue combines the speed and power of Apache Spark with the lightweight data organization of Hive metastores to — you guessed it — glue together disparate data sources from across AWS. While AWS Glue was announced over two years ago, it is still being actively developed, with new notable features added every few months. Glue is not a silver bullet and it might not be as seamless or accessible as you would like it to be, but with the right strategy and use case, it can be a great fit. I’d like to share my learnings from the last year of working with Glue, to help inform your decisions and also avoid some… sticky situations.

Why Choose Glue?

As a backend software engineer at Capital One, my team selected Glue as the solution for a serverless notification service that needed to process data from multiple sources across the company, including files and databases. While my team is not specifically a data engineering team, we recognized that our pattern was essentially an ETL pipeline: ETL jobs move data from a source, transform it to the form that is required, and load the resulting data and schema into a target system. We knew we wanted a serverless solution if at all possible, and when we found that AWS had a serverless ETL service, we decided to take the plunge and become the first team in the company to use Glue jobs in production.

Glue strives to address both data setup and processing in one place with minimal infrastructure setup. The Glue data catalog can make both file-based and traditional data sources available to Glue jobs, including schema detection through crawlers. Glue’s data catalog can share a Hive metastore with AWS Athena, a convenient feature for existing Athena users like us.

To run jobs that process this data, Glue can use a Python shell, Spark, or most recently, Spark Streaming, a beta feature that you have to enable. Glue jobs can be written in Python or Scala. My team used Spark with Python3, so I will speak from that experience. When Glue jobs use Spark, a Spark cluster is automatically spun up as soon as a job is run. Instead of manually configuring and managing Spark clusters on EC2 or EMR, Glue handles that for you nearly invisibly.

One of the most useful things about Glue is that its default timeout is two days — unlike Lambda’s max of 15 minutes. This means that Glue jobs can be used essentially like a Lambda for jobs that were too long or too unpredictable. While Glue was not intended for this purpose, AWS Solutions Architects have confirmed that this is a common and acceptable use case.

This AWS video, “Getting Started with AWS Glue ETL,” is a good introduction if you are a visual learner like myself:

Video of getting started with AWS Glue ETL, woman in jean jacket and white shirt is talking

A Short Tour of AWS Glue

Glue divides its main services into the Data Catalog and ETL. As Glue uses common data terms but slightly changes the definitions, I’ll translate them below.

AWS Glue Data Catalog

The Glue Data Catalog is where metadata must be stored for Glue jobs to access your data. When Athena and Glue are connected (Athena needs to be upgraded to do so), any Athena database or file-based table that exists in Athena is also available to Glue jobs.

Tables are not your typical relational database tables, but are instead metadata table definitions of data sources, not the data itself. It’s a little like a link with a preview — a Glue table tells you where the data is located and what data fields and types you should find there. Glue tables can describe file-based data stored in S3, such as Parquet, CSV, or JSON, as well as data in traditional datastores like RDS tables. The latter sources need to be connected and crawled in order to be accessible.
Databases are essentially a grouping of data sources that tables belong to. Creating a database in Glue only requires a name.
Connections create a verified link between Glue and RDS (Postgres, MySQL, etc), Redshift, or JDBC instances, and allow Glue to access the data stored there.
For the metadata of data sources accessed via connections to be accessible to Glue in your database, you need to set up and run a Crawler, which “connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog.” — AWS Glue documentation. Crawlers are also great at determining the schema of complex unstructured or semi-structured data, which can save a ton of time. Parquet and AVRO files are notoriously difficult to make human-readable, and crawlers can sort out those schemas in a matter of minutes.

ETL

The ETL section houses the scripts and tools that use the data that was set up in the data catalog to extract, transform, and load into the location where it is needed.

The core of Glue ETL is the Jobs. A job consists of a script that can load data from the sources in the data catalog and perform transformations on it. Glue can autogenerate a script, or you can write your own in Python (PySpark) or Scala. Glue also allows you to import external libraries and custom code to your job by linking to a zip file in S3. As we were developing before Glue was able to be run locally, we isolated the Glue-specific code in the job script, then moved the rest of our Python code to a more typical and testable application structure that we zip up and deploy with the script.
ML Transforms are a subcategory of jobs that “provides machine learning capabilities to create custom transforms to cleanse your data.” You can connect data stores from the catalog and “tune transform” to identify duplicate data, etc.
Triggers run jobs. They can run on a schedule, on command, or upon a job event, and they accept cron commands.
Workflows are a combination of triggers and jobs. For example, you may have one job that needs to complete before the next job is run. You can create a workflow with a trigger that starts Job 1, and a trigger that waits for Job 1 to complete and starts Job 2.
Glue also has Dev Endpoints and Notebooks, which allow one to develop and test scripts more efficiently, as they provision a dedicated Spark cluster that you can run jobs on continuously. It is not possible to use them in my development environment, so I cannot speak to their functionality.

Getting Started in Glue

First and foremost: in order for Glue to be able to glue everything together, you’ll need to set up your IAM roles and S3 bucket policies to be accessible to Glue. For your existing IAM role to be usable in Glue, you’ll need to add Glue to the trust relationships of that role. To use Athena with Glue, you’ll need to upgrade and also modify existing roles. There are more of these gotchas that are environment-specific, and you’ll probably discover more as you go.

Glue screen with blue buttons and black callout text box with white text — Step-by-step instructions for creating a crawler.

After trying a few different learning strategies, I recommend using the interactive tutorials on the left sidebar in Glue first. They guide you through making a crawler, a table, and a trigger which helps give a sense of Glue’s capabilities faster than reading the docs.

I also recommend exploring their autogenerated scripts and code snippets. Add a new job and choose “This job runs proposed script generated by AWS Glue.” One of the easiest transforms is changing column name and order with Glue.

Glue mappings diagram with grey arrows and white and grey tables with blue and black text

After a script has been generated, you can add a variety of autogenerated transforms as well.

screenshot of glue screen with white and grey table containing black text in a list

Glue Sticking Points

On the whole, I’ve had a good experience with Glue. My team does not need to provision Spark clusters or rehydrate servers as we would have with a more traditional solution. We can also run code in a sandbox-style environment where we can test jobs by simply clicking “Run job.”

But, with heavy management comes less control, and there are some significant challenges that may impact your use case. When a service is managed, it also isn’t always clear what additional knowledge you need in order to get the most out of the product — or to avoid major issues.

Development Speed: Hurry Up and Wait

Glue jobs still run on Spark clusters, and they take a while to spin up whether or not you are managing them yourself. Jobs can take up to 20 minutes to start — not counting the time it takes to run — especially during peak times. A confounding factor is that jobs can fail to start because of an error, which Glue logs may not help you solve. I spoke to AWS support about this and there are apparently times where they simply don’t have any clusters available, resulting in the error “Resource unavailable.” My understanding is that they are working to solve this, but I am unsure of the timeline.

Jobs can start in less than a minute if you happen to be running on an existing cluster, but there is not a surefire way of keeping a cluster running or keeping your job on a running cluster without using a Dev Endpoint, which can significantly impact your AWS costs.

AWS seems to be taking this seriously — they have recently added this announcement, which allows users to sign up for a beta feature that guarantees jobs to start in less than a minute. I have signed up for this, but have not yet been given access to this feature.

Screenshot of notification in Glue with black text

Testing Can Be Tough

We spent much of our early days running jobs iteratively in the AWS console, which was slow and not version-controlled, as each new save of the script deletes the previous file. At the time, it was not possible to run Glue locally, so we could not test components locally. We worked around this problem by factoring out the Glue-specific code and testing the remainder with existing libraries like pytest, and more recently, began developing in Databricks.

At the end of August 2019, AWS announced that “…you can now import the released Java binaries of Glue ETL libraries using Maven… locally.” This could be a big improvement, but it is unfortunately not usable in our specific development environment, though it may work for you.

It’s Changing All The Time

Since Glue is still fairly new and being worked on actively, you may encounter problems that StackOverflow and the docs don’t yet solve for. With each improvement, though much appreciated, came complexity that needed to be managed. Integrating with various Boto3 clients and other AWS services caused some thorny problems with access and configuration. I did not find Glue error messages to be as helpful as they could be, and there does not seem to be a way to reduce noise in the verbose logs in order to identify fatal issues more quickly.

As with any tool, Glue can be used as it was intended, but it is fairly easy to use as it was not intended. The docs and tutorials represent Glue as an accessible solution that is easy to use and autogenerates scripts and takes away the hassle of managing Spark clusters. That increased accessibility opens the door to users that don’t necessarily follow the expected path, and the docs are not prepared for them (er… us).

Hindsight is 2020

If I could go back and do it over with what I know now, here’s what I would advise:

Either know Spark or don’t use Spark! We thought we could just write vanilla Python and run it in Glue. Nope — PySpark looks like Python but doesn’t act like Python. We spent a lot of time retroactively figuring out how to rewrite our code to be Spark-optimized in order to not run out of memory. In hindsight, considering our team had minimal Spark knowledge, we should have either done a crash course in Spark or used Glue essentially as an Spark-free Athena orchestrator or as a Lambda with a longer timeout.
Use Databricks for faster development. While we were learning how to use Glue, we were running jobs that often took ten minutes to start, then fail because of something like a syntax error. To fail faster, use Databricks to run code snippets (or your whole application) and confirm your syntax and logic without waiting for a new cluster to spin up in Glue. Databricks was founded by the same folks who created Spark, so it is a reliable sandbox environment to work in. It should be noted that there are some syntax differences between Glue Spark and Spark to look out for, mainly when the context is initialized. Running Glue locally is another option, but was not feasible during our development period.
Set up alerts to monitor job status changes. Considering how much time is spent checking to see if a job is done, I’m glad I finally set up an email alert through AWS SNS that tells me when my job succeeds or fails. Instead of constantly context switching, this allows me to monitor jobs without needing to log in to the AWS console. I recommend setting this up right away.

Glue and You

Despite these challenges, AWS Glue can be an efficient and useful tool that is great for a number of use cases that require moving or transforming data. My team’s transition to Glue has significantly reduced infrastructure and complexity, and I look forward to seeing how Glue can be used in the future. I hope that this introduction to Glue allows you to start experimenting with less friction to see if Glue is the right tool for your work.

Please feel free to reach out to me with your experiences with Glue.

DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.