AWS Glue Python Shell jobs — first impressions

Published in

Slido developers blog

7 min readFeb 18, 2019

A few weeks ago, Amazon has introduced a new addition to its AWS Glue offering: the so-called Python Shell jobs. As our ETL (Extract, Transform, Load) infrastructure at Slido uses AWS Glue extensively (more on that in a separate article), we were eager to try it out. Here is a short summary of our first impressions.

Why Python Shell jobs?

The Amazon Web Services (AWS) are known for being rather confusing — to such an extent that guides like “Amazon Web Services in Plain English” exist. AWS Glue is no exception. It is marketed as Amazon’s hosted solution for authoring and managing ETL jobs, which can be defined using Scala or Python. As Python was already supported, why would anyone be interested in this new ‘Python Shell’ thing?

Let’s see what the press release has to say about it:

Previously, AWS Glue jobs were limited to those that ran in a serverless Apache Spark environment. You can now use Python shell jobs, for example, to submit SQL queries to services such as Amazon Redshift, Amazon Athena, or Amazon EMR, or run machine-learning and scientific analyses.

In simple terms, the Python jobs on AWS Glue offered an opportunity to define a Spark job in Python. It is not really built for anything else (even things like parsing a yaml file), at least not out of the box. The Python Shell job tries to remedy this to some extent by providing the ability to run “quick” Python scripts that are out-of-the-box ready to interact with AWS (via its
boto3 library), do some relatively simple data wrangling with numpy, pandas or scikit-learn.preprocessing, or even train some (relatively small) Machine Learning models using scikit-learn itself.

It is never wise to judge a book by its cover (or an AWS offering by its
press release), so let us see what are these Python Shell jobs really like by taking a look at their internals.

Python Shell jobs, from the inside

One of the selling points of Python Shell jobs is the availability of various
pre-installed libraries that can be readily used with Python 2.7. The documentation mentions the following list:

Boto3
collections
CSV
gzip
multiprocessing
NumPy
pandas
pickle
re
SciPy
sklearn
sklearn.feature_extraction
sklearn.preprocessing
xml.etree.ElementTree
zipfile

Although the list looks quite nice, at least one notable detail is missing: version numbers of the respective packages. Thankfully, Python’s pkg_resources module allows us to get them in a fairly simple manner by executing the following short snippet:

And the full list can be seen below:

A list of pre-installed packages in AWS Glue Python Shell, along with their respective version numbers.

So what are we looking at?

If we take a closer look at the version numbers of numpy , pandas and scikit-learn — the three “pillars of data science” — and compare their respective release pages, at the time of writing this article we could see that these packages were more than 12, 13 and 18 months old. It therefore seems that Amazon took a conservative approach when deciding which packages to support. One can only assume why they did so, but it would not be unreasonable to expect that these older versions reflect the needs of their enterprise customers. A reliable old(er) package for such a customer is much more valuable than a new cutting-edge one, with a ton of potential to break in unexpected situations.

Amazon’s relationship with their (especially bigger) customers is a topic for a separate article, but it may be a plausible explanation for another library in the list above. The pygresql module provides an interface to PostgreSQL database engine from Python. There is no question that some Amazon’s customers make use of it (check for instance this list), but they also seem to have a pretty big impact on AWS Glue’s roadmap. Although PostgreSQL has certainly been gaining popularity in the past few years, it seems strange not to include a package for interacting with MySQL (such as pymysql), and thus neglect a sizable amount of its installations that still exist. Especially since many of them can be found on AWS infrastructure.

Nearly all of the remaining packages on the list seem to support what looks like the Amazon’s primary goal with AWS Glue Python Shell jobs: another way of executing AWS API commands, from within AWS itself. Arguably, the simplest way of doing so up until now was AWS Lambda and some seem to use it that way. The newly introduced Python Shell jobs let you achieve similar results, without ever leaving the AWS Glue interface.

Having gone through the list of actually installed packages, one cannot help but wonder what is Amazon’s goal with listing support for various packages in the documentation. Not only is the list far from exhaustive as we saw above, but more than a half of the packages it mentions are actually built-in and come with the standard Python installation. As such, when writing Python scripts, these packages are taken for granted, along with other features the Python language provides. Explicitly mentioning their support seems redundant at best.

Python 2? Really?!

While the previous section has discussed quite a lot of technical details, it did not mention the potentially most interesting one: the version of Python itself.

By Amazon’s own admission in the docs, we know that with AWS Glue Python Shell job “you can run scripts that are compatible with Python 2.7”. Thanks to the listing above, we know that it actually runs version 2.7.14, which has been released 17 months before this article was written. Given all we learned in the previous section, this version of Python can be actually considered fairly recent! All in all, not a big deal, right?

Well, sort of.

The problem here is the future outlook. Although some doubted it at first, most of the relevant open-source packages in the data-science spectrum have pledged to drop Python 2 support by 2020, which also happens to be the End Of Life date of Python 2.7. Using Python 2 beyond that point means using obsolete and not really supported technology.

Looking at the “pillars of data-science” — that is numpy , pandas and scikit-learn — they all dropped Python 2.7 support for future releases last year, meaning that all newly released versions will require Python 3. In other words, any AWS Glue Python Shell script written today will most probably need to be updated to work correctly with Python 3 at some point in the future. This should not be so bad with tools like 2to3, but it is something worth considering when choosing to write a Python Shell job on AWS Glue today.

Still, releasing a product that only supports a language which is in its last year of official support has the potential to raise at least a couple of eyebrows. Hence the interrobang in the subtitle.

Price, possible use cases and other considerations

Similarly to other AWS Glue jobs, the Python Shell job is priced at $0.44 per Data Processing Unit (DPU) hour, with a 1-minute minimum. The term DPU has the potential to sound both cool and intimidating, but per the documentation it loosely translates to “4 vCPUs of compute capacity and 16GB of memory”. A standard Python Shell job can use either a single DPU or 1/16 of its capacity (Amazon keeps mentioning 0.0625 in their materials) with the price adapted accordingly.

Doing some quick math, it seems that running quick (up to 1 minute of execution time) Python scripts that would, for instance, only use boto3 to trigger some AWS operations is more than three times cheaper on AWS Lambda ($0.00012) than on AWS Glue ($0.00046) with their current pricing. Once the raw processing power starts to matter, and close to 1GB of memory is required, AWS Glue becomes a bit cheaper than AWS Lambda ($0.00046 vs $0.00100). Plus, it comes with more than a few pre-installed libraries, which can bequite useful.

This economical perspective may also suggest at which point does it actually make sense to use AWS Glue Python Shell jobs. If you are looking for a quick and relatively cheap way of training scikit-learn models in the AWS ecosystem, this new offering is probably the best answer there currently is. Despite the claims in the press release, however, it seems to still be economically more viable to submit queries to Amazon Redshift, Amazon Athena, or Amazon EMR using AWS Lambda.

Conclusions

AWS Glue Python Shell jobs is certainly an interesting addition to the AWS Glue family, especially when it comes to smaller-scale data-wrangling or even training and then using small(er) Machine Learning models. At first sight it may look like Amazon is trying to provide a competitor to Apache Airflow, but that does not seem to be the case. Even the business case presented at re:Invent where AWS Glue Python Shell jobs were introduced utilized Airflow in the end (see the actual presentation slides for more details).

Though there may be some functionality overlap between this new product and AWS Lambda which was also available before, it seems to be a better option when more computational resources are required. The added benefit is that you do not need to deal with setting up the required components: they are readily available. There is a small caveat though: you will be working with fairly old versions of the provided packages, in a language that will no longer be officially supported next year.

Despite all that, here at Slido we will most probably make use of AWS Glue Python Shell jobs whenever we need to train bigger amounts of scikit-learn models at scale, as it currently seems to be the easiest and most cost-effective solution AWS has to offer for this use-case.

Thanks to Katarína Mrvová, Peter Hraška and Jan Soltis for reading drafts of this.