Professional AWS Glue PySpark Development — Mocking AWS services
In this post we take a look at mocking AWS resources in a local development environment.
The code used for mocking is relatively short but this is intended as a post for those who are new to mocking. That’s why there is also lots of theory and context to make sure everyone is on board.
If you’re familiar with mocking in general, you can likely skip a lot or perhaps just take a look at this post , it is much shorter. If not, my advise is to read the following from top to bottom.
What is mocking?
In software development, mocking refers to the creation of a simulated version of a dependant system or component for testing purposes. Mocking is a useful technique for isolating the code under test and verifying its behavior in a controlled environment. It allows developers to test their code without relying on the actual implementation of the dependent system or component, which can be unreliable, unavailable, or costly.
Why mocking?
Mocking is useful in the development of AWS Glue jobs because the underlying data sources and destinations are often hosted on AWS services such as S3, RDS, and Redshift. Testing the behavior of a Glue job with real data can be time-consuming and costly, as it requires setting up and maintaining the required infrastructure. Mocking the dependent services allows developers to focus on testing the logic and functionality of their Glue jobs without worrying about the underlying data sources and destinations.
The moto library
We are going to use moto for mocking AWS services. Its name is play on “mock” and “boto3”. It is a Python library that allows for the mocking of various AWS services in a local environment. It is commonly used for testing purposes, as it allows developers to create simulated versions of these services that can be used in place of the actual AWS services for testing purposes.
This is particularly useful because:
- It saves costs
- It creates an isolated environment
- You can control the behavior of the mocked services, which can be useful for testing code under various conditions.
Options for mocking in local development
There are different approaches to mocking AWS services in your local development environment. You can read more about the pros and cons of these approaches in my other post.
Basically you could decide between:
- Running the mock environment on the local machine or a separate machine
- Running the mock environment permanently or temporarily
For this post we go with a temporarily running environment on each developers’ local machine.
You can picture the setup like this.
In case you’re wondering where this Docker container is coming from or why we’re running the Development Container extension, read my previous post.
When you start the tests there’s a line in the pytest test files that triggers a subprocess to spin up and run moto server. The piece of code to watch out for looks something like the following. There will also be a full example in the next section.
process = subprocess.Popen(
"moto_server s3 -p5000",
stdout=subprocess.PIPE,
shell=True,
preexec_fn=os.setsid
)
For this to work, the moto library needs to be installed in the Docker container beforehand. We take care of that in the next section as well.
Example
In this section we first setup an example project, take a look at the code and finally run a unit test with a mocked S3 bucket.
Setup
Once again there’s a repository for you to follow along:
git clone https://github.com/dmschauer/aws-glue-local-mocking
cd aws-glue-local-mocking
Create a virtual environment called .venv in the root of the project:
python3 –m venv .venv
Activate the virtual environment:
Windows: ./.venv/Scripts/activate.ps1
Unix: source ./.venv/bin/activate
Because moto isn’t a dependency for the actual Glue ETL job it goes only into the requirements-test.txt in the /tests folder. Run the following to install development dependencies in your virtual environment:
pip install ./tests/requirements-test.txt
Now start VS Code and attach a AWS Glue development container as described in my previous post.
Your environment will look like this:
Code explanation
This section explains the code — first the general idea, then the code itself and then the corresponding tests. This is necessary to understand the context in which resources are mocked.
The code is used for processing JSON files that contain orders in a nested format. The orders are read into a DataFrame, flattened and stored in Parquet format. Also the date of the orders isn’t in the file content but in its name. This data is stored as new columns before writing to disk.
The source code for the Glue ETL job is in scr/glue_process_orders_json.py
.
The Glue job takes three arguments:
args = getResolvedOptions(
sys.argv, ["JOB_NAME", "table_name", "source_path"]
)
- JOB_NAME: name of the Glue job to be displayed in the Spark UI
- source_path: where to find the JSON data to process
- table_name: where to write the transformed data in parquet format
Then there are two functions:
- process_data() contains the main job logic. It reads a json file from a subdirectory of a mocked S3 bucket, calls the transform() function and writes the result back to another subdirectory of the same mocked S3 bucket.
- transform() contains the transformation logic itself. The transformations are described in the beginning of this section
- It is a common best practice to separate your logic like this to enable unit testing. Transformations should be separated from read and write operations.
- You could also split the transform() into multiple transformation steps, each with their own function. I didn’t do this for brevity of this example, but then you could test each transformation step separately.
The corresponding tests are in tests/test_glue_process_orders_json.py
You will find:
- A set of variables for configuring the test environment
- SOURCE_NAME = “orders_1_2022–11–20T19–27–27.json”
- TABLE_NAME = “orders”
- S3_BUCKET_NAME = “data-s3”
- ENDPOINT_URL = “http://127.0.0.1:5000/"
- A function for creating the test environment called initialize_environment(). It creates a S3 bucket; Takes a file containing test data from ./tests/data and puts it into the bucket; Creates a subprocess on ENDPOINT_URL; On cleanup deletes all files in the bucket and shuts down the subprocess
- A helper function called compare_schema() that does exactly that
- Three tests, two of which use initialize_environment()
- test_transform_output_schema(): tests the function transform(), uses the helper function compare_schema() but doesn’t create a mocked environment
- test_process_data_partitioning(): tests the function process_data() regarding the partitioned data it created in the mocked S3 bucket. For a successful test the expected partitions with expected corresponding values must be present.
- test_process_data_count():tests the function process_data() regarding the number of created records. We expect there to be 86 denormalized records.
Mocking code explanation
The mocking itself is implemented in initialize_environment().
It starts moto_server on a subprocess on port 5000
process = subprocess.Popen(
"moto_server s3 -p5000",
stdout=subprocess.PIPE,
shell=True,
preexec_fn=os.setsid
)
And it creates an S3 bucket using boto3 on ENDPOINT_URL which is set to local port 5000 as well.
ENDPOINT_URL = "http://127.0.0.1:5000/"
This way the boto3 requests aren’t going to AWS but to the local moto environment where the bucket is created instead.
The function is decorated with @contextlib.contextmanager
. The following two paragraphs are a short summary on what this does.
This is used to define a context manager. A context manager is an object that defines the methods __enter__ and __exit__, which are used to set up and tear down a block of code, respectively. When a context manager is used in a with statement, the __enter__ method is called at the beginning of the with block and the __exit__ method is called at the end of the block, even if an exception occurs. This can be useful for managing resources that need to be set up and cleaned up, such as file handles or database connections.
In this case, the context manager initialize_environment() is used to set up a local S3 bucket and load test data into it before running a test. The __enter__ method starts the moto server, creates an S3 bucket, and uploads a test data file to the bucket. The __exit__ method cleans up the AWS resources and shuts down the moto server. The yield statement is used to pause execution of the context manager and pass control to the code block in the with statement. When the code block finishes executing, control is returned to the context manager and execution continues after the yield statement, allowing the __exit__ method to clean up the resources.
test_process_data_partitioning() and test_process_data_record_count() use this by calling the following:
with initialize_environment(spark) as (process, s3):
# logic ommitted for brevity
In this context they have access to the process and the S3 bucket for as long as the tests are running.
It is important to note that this also means that test environment is built up and tore down multiple times! So the state of the mocked environment is not shared between the tests. Each test starts with a fresh environment and deletes it once it is finished.
Another thing to note is that you might need different environments for different test scenarios — for example an empty bucket, one contained one input file and one containing two input files. The approach explained above allows you to setup each of those environment in their own @contextlib.contextmanager
and call them as needed.
Running the code
We will run the code as part of Pytest, same as last time.
python3 -m pytest ./tests/
When you do so VS Code will give you multiple pop-ups. Two of them will tell you that there is a new process running on port 5000. This is because moto_server is started (and shut down) twice as explained above.
Closing words
Congratulations! Now you know not only how to setup a local Glue PySpark development environment and how to run unit tests, you also know how to mock AWS resources locally.
This is a valuable skill to have for AWS developers in general and definitely not limited to Glue job development.
Next time we will take a look at deploying a locally developed Glue job in the AWS cloud using a CI/CD pipeline.