Getting started with PySpark, Pytest, and Amazon S3

A step-by-step guide for Data Quality Engineers

Likitha Lokesh

Published in

Slalom Build

9 min readMay 28, 2024

This article has been written in collaboration with Taylor Wagner.

As a Quality Engineer Lead responsible for testing complex scenarios in data transformation projects, I often require assistance creating and executing these tests. Correctly setting things up is often time-consuming and takes some effort, leading to unforeseen delays.

My aim in writing this article is to help Data Quality engineers with some familiarity with data tools and frameworks discover ways to integrate and set up Python test frameworks with AWS to validate huge sets of data.

A bit of background

Now more than ever, it is essential to ensure quality and security are maintained throughout the development process in data-specific projects. It is necessary to ensure the data is reliable, accurate, and complete.

Recently, I was the QE lead for a data engineering project where I had the opportunity to build an automated testing framework to validate data transformations created with the help of PySpark, AWS Glue, and Amazon S3.

Getting started with PySpark, Pytest, and Amazon S3

A step-by-step guide for Data Quality Engineers

Written by Likitha Lokesh