Professional AWS Glue PySpark Development — Local Development and Unit Tests

Dominik Schauer
10 min readDec 18, 2022

--

Introduction to this series of articles

This is the first part of a series of articles. They will help anyone to set up a professional AWS Glue Pyspark development environment.

Topics to cover:

  • Local Development
  • Unit tests
  • Mocking AWS resources
  • CI/CD Pipeline
  • Integration Tests
  • Git Project Structure
  • Orchestrating multiple ETL Jobs
  • Infrastructure as Code
  • Complete project putting everything together

Often you will find posts that try to cover several of these topics at once. To keep things simple I will separate them. The posts are structured in such a way that the complexity of the project will increase over time. I consider this to be the most user friendly approach.

My recommendation is to read them from start to finish and to code along. There is definitely a steep learning curve but in the long run you will save a lot of time by starting with the fundamentals. If you got a good grasp already, feel free to skip to the parts that interest you the most.

Introduction to this article

In case you’re new to AWS Glue you might be afraid of the cloud computing cost that comes with executing Glue jobs 1000 times just to test things out. That’s why we start this series by setting up a local development environment to get rid of this anxiety.

This article covers local development and local unit testing. First we take a look at options for developing AWS Glue jobs — local development and using interactive sessions.

Afterwards I guide you in setting up your local development environment. If you’re working on a team project, this is what each developer will have to do on his local machine.

Finally we will add a unit test to our PySpark code and run it locally.

Which options do I have for developing AWS Glue Jobs in general?

Before jumping into how to setup the local development, I first want to mention alternative approaches.

You have the following options for developing AWS Glue jobs:

What are their pros and cons?

So there are basically two options: Paying AWS for ease of use and processing power via Interactive Sessions or developing locally free of charge but with added complexity and possibly slower processing.

The choice depends on the timeline of your project and the cost of labor. If you’re short on time and want quick results, go for Interactive Sessions. If your developers’ time is precious go for Interactive Sessions as well. Local development will save you cloud infrastructure cost but the added cost of labor might be higher than the savings. This is a trade-off that needs to be considered when choosing between these two.

For me the killer feature of local development is that it allows you to run unit tests. Interactive Sessions offer you only a Jupyter Notebooks development experience and thus doesn’t allow you to test your code in the same automated way.

Local Development — Introduction

As mentioned above there are two distinct options for developing locally — you can use the official AWS Glue Container maintained by Amazon on DockerHub or install a set of directly libraries on your local machine. So this is another choice you need to make.

Running the container is resource-heavy. In my personal environment the running container takes up about 7–11 GB of RAM on Windows and about 4 GB on Ubuntu. Using the locally installed libraries would be a more lightweight approach. Personally I still prefer the Docker option. The advantages are 1. ease of setup and 2. less risk of errors caused by outdated libraries and particularities of individual developers’ environments. So Option A: Docker Container is what we’re going with here.

Step by Step Guide

The main source for this article is the official AWS Glue documentation on “Developing and testing locally”. On the AWS Big Data Blog you can also find another similar tutorial.

The documentation contains only code written for Unix users. Since I’m using Windows 10 most of the time I also made some slight modifications and provide them here.

Step 1: Install Visual Studio Code

Just a standard installation. You can just google this.

Step 2: Docker

  • Install Docker
  • Start the Docker Daemon
  • In case you’re using Windows: Make sure you’re using Linux Containers by right-clicking the Docker Icon in the task bar and choose Switch to Linux containters… in case you have to. In case the option you see is Switch to Windows containers...it means that you’re already good to go.
  • Pull the Docker Image from DockerHub. To do so, run the following in PowerShell or Unix Shell: docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01
  • This image is meant for developing Glue 3.0 jobs. There might be a more recent version available when you’re reading this article or your organization requires you to develop 2.0 jobs. Check https://hub.docker.com/r/amazon/aws-glue-libs/tags for other versions.
  • This step might take some time as you’re downloading 2 GB and then extract it. You can continue with the next steps in the meantime.

Step 3: AWS Profile

  • Install the AWS CLI
  • Setup an AWS Profile:
    In PowerShell: $PROFILE_NAME="glue-dev"
    In Unix/bash: PROFILE_NAME=glue-dev
  • You could also use another name for your profile, I simply chose mine to be glue-dev, nothing special about the name. We will need the name of the profile later when we pass credentials to the Docker container and tell it which profile to run. That’s why we store it in a variable.
  • In case it’s the first time you setup an AWS profile, read this.
  • Make sure that your AWS profile has exactly the same name that you chose in the variable above. Otherwise you will run into permission errors later.
  • You wouldn’t do this in production, but for demonstration purposes, set up your IAM user with admin priviledges. In case that’s not possible choose S3 read permissions on s3://awsglue-datasets/examples/us-legislators/all/persons.json. We will use this public bucket provided by AWS in the sample code in one of the following steps.

Step 4: VS Code Remote Containers

  • Open VS Code
  • Install the “Python” Extension maintained by Microsoft
  • Install the ”Remote Development” Extension maintained by Microsoft
  • In case you want to know more about what’s happening behind the scenes when you use Remote Development, you can read this.
  • Open Settings (Crtl + ,)
  • (In case the Settings UI opens, go to Workbench > Settings > Editor and set the value to json. Then close the settings tab and open it again.)
  • In case the settings.json opens, paste the following two key-value pairs in and save it:
{
“python.defaultInterpreterPath”: “/usr/bin/python3”,
“python.analysis.extraPaths”: [
“/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/”,
]
}
  • You will probably have other settings there as well. Don’t worry about them, that’s fine. In this case you won’t need an extra set of curly brackets either. Just paste in the key-value pairs without surrounding curly brackets in this case.

Step 5: Prepare sample code

Step 6: Prepare environment variables

  • In Powershell
$WORKSPACE_LOCATION=”<your project location>”
$SCRIPT_FILE_NAME=”sample.py”
$UNIT_TEST_FILE_NAME=”test_sample.py”
$AWS_FOLDER_LOCATION=”<your local AWS folder containing config and credentials>”
$AWS_PROFILE="glue-dev" # in case you didn't set it already as instructed in Step 3
  • Example in my case:
$WORKSPACE_LOCATION=”C:\Users\domin\projects\aws-glue-local-dev-and-test”
$SCRIPT_FILE_NAME=”sample.py”
$UNIT_TEST_FILE_NAME=”test_sample.py”
$AWS_FOLDER_LOCATION=”C:\Users\domin\.aws”
  • In Unix/bash:
WORKSPACE_LOCATION=/local_path_to_workspace
SCRIPT_FILE_NAME=sample.py
UNIT_TEST_FILE_NAME=test_sample.py
AWS_FOLDER_LOCATION=<your local AWS folder containing config and credentials> (Likely this is: AWS_FOLDER_LOCATION=~/.aws)
AWS_PROFILE=glue-dev # in case you didn't set it already as instructed in Step 3

Step 7: Run the Docker Container

This is the first of the two most crucial steps in this guide and probably what you’ve been waiting for. We run the official AWS Glue container in a separate process. You will need to keep a separate PowerShell/Unix process running for as long as you’re developing.

  • In PowerShell:
docker run -it -v ${AWS_FOLDER_LOCATION}:/home/glue_user/.aws -v ${WORKSPACE_LOCATION}:/home/glue_user/workspace/ -e AWS_PROFILE=${PROFILE_NAME} -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
  • In Unix/bash:
docker run -it -v $AWS_FOLDER_LOCATION:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark

Here we run the image with three important settings:

  • -v ${AWS_FOLDER_LOCATION}:/home/glue_user/.aws
    The container expects to find its AWS credentials in /home/glue_user/.aws. To use our own credentials we mount our local folder containing our own AWS credentials into this location.
    In case you run into AWS permission errors while running the PySpark jobs later, the problem is likely to be found either here or with the permissions of the IAM user you used for the profile.
  • -v ${WORKSPACE_LOCATION}:/home/glue_user/workspace
    Similarily the container expects the PySpark code to be found in home/glue_user/workspace. To use our own code we mount the folder containing our code there.
    Note: When you change something in this folder during development in VS Code, you will also be able to version control the code even from outside of the container, straight in your local workspace location
  • -e AWS_PROFILE=${PROFILE_NAME}
    This tells the container which profile to use to access AWS. Of course we supply the AWS profile created in the beginning of this guide.
    This setting is another potential cause of permission errors.

In case you’re struggling with this setup take a look at my Git repository. In the folder setup you find a short script each for Windows and Unix environments. You can change the values and then run:

  • Windows: ./setup/windows_environment_setup.ps1
  • Unix: source ./setup/unix_environment_setup

Step 8: Connect VS Code and the Docker Container

Now that the container is running, it’s time to link VS Code. For this we needed to install the Remote Development Extension earlier. (I’m copying the following instructions straight from the AWS Glue documentation. That’s why you see Mac OS screenshots here):

  • Start Visual Studio Code.
  • Choose Remote Explorer on the left menu, and choose amazon/aws-glue-libs:glue_libs_3.0.0_image_01.
  • Right click and choose Attach to Container. If a dialog is shown, choose Got it.
  • Open/home/glue_user/workspace.

Step 9: Submit a PySpark Job to the container

  • Create a Glue PySpark script and choose Run.
  • (In case you don’t see the Run icon, first install the Python Extension again inside the container by clicking on the extension tab. Extensions might differ between your local VS Code environment and the one found in the container.)
  • You will see the successful run of the script.

Step 10: Run Pytest unit tests on the container

The documentation explains how to run Pytest from outside of a container but there is no need to switch environments. Of course you can also run tests directly in the container: /bin/python3 -m pytest

Remember the comment in the beginning about increased time of development by using a local environment instead of interactive sessions? Here you can see this in action. My machine has 16 GB RAM and a fairly powerful CPU and still — running a fairly very basic test on a very basic transformation on a small dataset took almost a minute. This waiting time cumulates quickly.

Closing words

So there you go. You just ran your first AWS Glue PySpark script in a local development environment using VS Code and the official AWS Glue Docker container. You also ran a Pytest unit test on it locally. Now you can extend the logic to fit your own requirements.

To learn more:

--

--