Integration Testing Azure Synapse Analytics Pipelines and beyond!

Published in

Microsoft Azure

4 min readNov 26, 2021

Integration Testing is a type of testing where software modules are integrated logically and tested as a logical entity.

Integration Testing is often used as a misnomer for End to End Testing, but it's slightly different. You can read more about it here.

In this article, we talk about testing data Azure Synapse Analytics Pipelines.
This testing exposes the defects in the interaction between pipelines and different components(like SQL, Spark) in the data lifecycle. This testing is additional from unit testing, where we mock all dependencies and then test our pipeline notebook code.
They will also help us to detect any breaking changes in the code for pipelines.
The integration tests can be part of the CI/CD process and triggered through any CI tool.

Source Code

Current Setup

We are using Azure Data Lake Storage as our Lake provider.
We have an Azure Synapse Analytics Pipeline that executes a Notebook, and for illustration, we have two zones Raw & Processed. The pipelines move data from source(Landing Zone) to Raw zone and from Raw Zone to processed zone based upon the different parameters supplied during execution.
The data in the Raw zone is stored as Spark tables, whereas in the processed zone is stored in the form of SQL Tables.
We are storing our secrets to connect with ADLS and Dedicated SQL Pool in Azure Key Vault.

Code Structure

Now let's walk through the code structure for the integration tests :

data — Contains sample employee file in parquet to be uploaded to ADLS.
data connector — Connector to ADLS which helps us perform different operations on ADLS
tests — the integration tests to run
utils — utilities for various components of the integration tests.

Integration Testing Workflow

We can see that all our tests reside in the tests folder and for this article, our test method is test_source_to_processed_workflow.

Inside our test workflow, we have two methods each for testing components involved during the movement of data from source to raw and from raw to processed.

Test Workflow for our Integration Tests

We are using Python and will be using PyTest as our choice of testing framework.

We will execute the following workflow to test different components involved during pipeline execution :

Upload a sample parquet file that consists of some employee data to ADLS.

The sample parquet file consists of entries of user data.

2. Invoke a Pipeline that moves data from source(landing) Azure ADLS to raw with pipeline parameters specific to the layer migration.

Sample Params for Source to Raw Migration

https://gist.github.com/ca1fb396f300fa43acd1b1b4cf9c7895

We are using Azure Synapse APIs to invoke and trigger pipelines. All the pipeline-related code is in pipelineutils.py.

3. Check for pipeline success and perform assertions on data.

We check if the number of rows fetched our greater than 1 and data has been successfully moved from source to raw zone.

4. Invoke Pipeline that moves from raw to processed with pipeline parameters specific to this layer data movement.

https://gist.github.com/a6047c645a794f370e4d455ada1b7972

5. Check for pipeline success and perform assertions on migrated data stored as SQL Table in Azure Dedicated SQL Pool.

We are using pyodbc library to connect to the dedicated SQL Pool in Azure Synapse and then query the table.

Here we assert that the number of rows in our SQL Table should be the same as the number of rows in the uploaded sample parquet file.

6. Invoke Cleanup Pipeline and perform cleanup steps which clean up specified files and tables in Azure Data Lake Storage to ensure that tests are run from a clean state every time.

The cleanup is run after every test execution using pytest fixtures. In this, we are cleaning up the files from ADLS and the Spark and SQL Tables.

Running the Tests

The integration tests can be run inside the integration tests folder

python -m pytest

Conclusion

We are uploading a parquet file and moving the data across different zones. At each data zone, we assert and test if our logic for notebooks and pipelines is consistent. We interact with various components like Azure Synapse, Azure Dedicated SQL, Jupyter Notebooks, etc., and test all the components in harmony. These tests can be run on every CI to ensure no breaking changes in our code with different components.