Professional AWS Glue PySpark Development — Interactive Sessions

Dominik Schauer
9 min readDec 20, 2022

--

This is the second part of my series of developing AWS Glue jobs. Last time we took a look at local development in VS Code utilizing a Docker Container.

This time we will set up a local development environment using a Jupyter notebook and an AWS Glue Interactive Session.

Structure of this post:

  • First there’s some theory: Jupyter Notebooks, Interactive Sessions and when you would use them
  • Then we follow the documentation to set up our environment on Windows 10 (Linux and MacOS are very similar)
  • And finally we will follow up with an example including sample code and screenshots. You will also find a GitHub repository when we get to the example.

Jupyter notebooks

Jupyter notebooks are interactive documents that contain a combination of text, code, and code output. They are often used for data analysis and scientific computing but also come in handy for prototyping data transformations.

Interactive Sessions

AWS Glue Interactive Sessions are a feature of the AWS Glue service that allows users to analyze and transform data stored in Amazon S3 using Python (or Scala). From a developers perspective they provide a serverless PySpark (or Spark) environment that can be linked to one of two options:

  • Interactive Sessions Notebooks. These are Jupyter Notebooks managed by AWS and accessed via the AWS Console in the browser.
  • Your own Jupyter Notebooks running locally or on a local server

Interactive Sessions is a serverless service that is billed for at least 1 minute each time you start a sessions. It automatically shuts down when you stop using it. The default is 48 hours but my recommendation is to go a lot lower, more on that later.

When to use this setup?

This setup is ideal for rapid prototyping as opposed to writing production code. This is for two reasons:

  • Jupyter Notebooks allow you to get immediate feedback and store results to look at again later, allowing you to walk away and get back to work without executing things again. You can also share them with your team members, for example as part of your Git repositiory.
  • The interactive session takes load away from your local machine. Last time we ran a resource-hungry Docker container. This time the Spark code runs on the AWS side.

Why would I run Jupyter notebooks locally if AWS also supplies Juypyter notebooks in the console?

Of course you can also do so, but this makes it harder to version control your Jupyter Notebook. When your .ipynb is stored locally you can just commit your work and push it to a repository. When you’re working in the AWS Console you would instead have to download the notebook each time you want to do so.

The advantage is of course that you don’t need to care about manually setting up your local development environment since everything is in the cloud, so there are situations when using the cloud notebooks is justified as well.

Getting started

My main resource is once again the official AWS Glue documentation. I’m also still using Windows 10 but there you will find equivalent instructions for Linux and MacOS. The following is more or less copied from there but I also provide my personal experience and recommendations here.

To code along you can use this Github repository:

Installing Jupyter and AWS Glue interactive sessions kernels

  1. Use pip to install Jupyter. Jupyter Lab is also compatible and can be installed instead.
    Personal remark: I edited the code below to specify the exact version of the aws-glue-sessions Python library to be installed. At the time of my writing (December 2022) there seems to be a bug with the latest versions of the library that affects Windows users. You can follow this Issue on GitHub and this question on StackOverflow to stay up to date. When the issue is resolved you can of course go ahead with the latest version. pip3 install --upgrade jupyter boto3 aws-glue-sessions==0.32
  2. (Optional) Run the following command to list the installed packages. If jupyter and aws-glue-sessions were successfully installed, you should see a long list of packages, including jupyter 1.0.0 (or later)
    pip3 list
  3. Install the sessions kernels into Jupyter by running the following commands. These commands will look up the installation location for aws-glue-sessions from pip and install the Jupyter kernels therein.
  4. Change the directory to the aws-glue-sessions install directory within python's site-packages directory. In Windows PowerShell:
    cd ((pip3 show aws-glue-sessions | Select-String Location | % {$_ -replace("Location: ","")})+"\aws_glue_interactive_sessions_kernel")
  5. Install the AWS Glue PySpark and AWS Glue Scala kernels.
    jupyter-kernelspec install glue_pyspark
    jupyter-kernelspec install glue_spark
    Personal remark: Running the second line isn’t necessary for us because we only care about PySpark, not Spark (Scala).

Configuring session credentials and region

AWS Glue interactive sessions requires the same IAM permissions as AWS Glue Jobs and Dev Endpoints. Specify the role used with interactive sessions in one of two ways:

  1. With the %iam_role and %region magics
  2. With an additional line in ~/.aws/config
    Personal remark: When you’re on Windows this will most likely be the following path: C:\Users\<your username>\.aws\config

Configuring a session role with magic

In the first cell, type %iam_role <YourGlueServiceRole> in the first cell executed.

Configuring a session role with ~/.aws/config

AWS Glue Service Role for interactive sessions can either be specified in the notebook itself or stored alongside the AWS CLI config. If you have a role you typically use with AWS Glue Jobs this will be that role. If you do not have a role you use for AWS Glue jobs, please follow this guide, Setting up IAM permissions for AWS Glue , to set one up.

To set this role as the default role for interactive sessions:

  1. With a text editor, open ~/.aws/credentials.
  2. Look for the profile you use for AWS Glue. If you don’t use a profile, use the [Default] profile.
  3. Add a line in the profile for the role you intend to use like glue_role_arn=<AWSGlueServiceRole>.
  4. [Optional]: If your profile does not have a default region set, I recommend adding one with region=us-east-1, replacing us-east-1 with your desired region.
  5. Save the config.

Personal remark:

I went the “configure a session role” route. Let’s take a look at an example. This is what your credentials file might look like. There is a default profile and there is also a profile used for our Glue project. I called mine “glue-dev” but you can choose any name you like. Just keep the name in mind for later.

[default]
aws_access_key_id = <your id>
aws_secret_access_key = <your key>
[glue-dev]
aws_access_key_id = <your id>
aws_secret_access_key = <your key>
glue_role_arn = arn:aws:iam::<your account id>:role/<name of your role>
region = <your chosen region>

For more information, see Interactive sessions with IAM.

Running Jupyter

To run Jupyter Notebook, complete the following steps.

  1. Run the following command to launch Jupyter Notebook.
    jupyter notebook

Personal remark:
In case you’re working with a Git repository you would likely want to run this command from there. Aside the path from which you run the above command doesn’t really matter.

You will find a browser window looking something like this:

  1. Choose New, and then choose one of the AWS Glue kernels to begin coding against AWS Glue.
    Personal remark:
    We’re choosing Glue PySpark of course.

What’s next?

At this point the AWS documentation leaves you alone but we will continue.

Use this GitHub repository to code along: https://github.com/dmschauer/aws-glue-interactive-sessions

You will get a need browser window with an empty Jupyter notebook. I named mine “interactive-sessions-test” but again you’re free to choose another name.

Notice that the Glue PySpark Kernel isn’t running yet. It will only start to run when you’re connected to an Interactive Session.

Connecting to an Interactive Session

AWS provides this list of magics to configure your session. You can either go with the defaults or run them first before you start with your PySpark code.

I strongly recommend to change the defaults in order to save money.

Let’s take a look at the ones I consider to be most important.

%profile glue-dev     
# The profile we set up in .aws\credentials

%region us-east-1
# We already configured this in .aws\credentials but just to make sure

%worker_type G.1X
# G.1X costs half as much as G.2X (surprising, right?)

%number_of_workers 2
# 2 is the minimum of workers allowed
# The default is 5 workers. We're already saving 60% of cost here.

%idle_timeout 5
# The default is 2880 minutes which is 48 hours.
# I doubt that most people want to run their on demand sessions for that long.
# So here we set 5 minutes

# I found that when you're using the 0.32 version of the aws-glue-sessions library
# you could also go with the even cheaper G.025X worker type
# but then you will need a minimum of 2 workers. This would still be cheaper though.

The output will look something like this:

Let’s check if we’re already incurring costs by running an Interactive Session

%list_sessions
%session_id

Nope. Nothing is running so far.

Add some sample code. I’m using a small JSON file that’s located in an S3 bucket that is available to the role used by my glue-dev profile. You can find it in the GitHub repository along with the rest of the code. You would first have to upload it in your own S3 bucket available to your role though.

Let’s run some basic code. What exactly it does isn’t important in the context of this post, but it simply loads the JSON file into a DynamicFrame, counts the number of rows and prints the schema.

Running this notebook cell will also create an Interactive Session before executing your code. From here on you’re paying:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

test_dynamicframe = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://<your bucket name>/<perhaps with a subdirectory>']}, # your s3 bucket name
'json')

print("Count:",test_dynamicframe.count())

test_dynamicframe.printSchema()

You can also check the AWS Console to see that a session has been created:

I immediately stop the sessions from the notebook since I don’t need it anymore.

I also check the status again to make sure that it’s shut down.

What you can see here is that the session still exists even when it’s stopped. That’s fine — you can check the AWS Glue Console again to see that it’s stopped.

From there you can also delete it for good.

Closing words

What’s your experience with AWS Glue Interactive Sessions?

Next time we will look at mocking AWS services.

--

--