Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

How to measure data quality in Airflow pipelines using open-source tool Soda-Core and apply it to Pandas dataframes. Bonus: How to do it on Azure Managed Airflow.

DataFairy
5 min readOct 9, 2023

Introduction

  • Why do we measure data quality?
  • When do we measure data quality?
  • What is Soda?

Why do we measure data quality?

Concepts and Practices to Ensure Data Quality | Metaplane

Data quality has been named the hot new topic for data-driven organizations in 2023. It describes the process of checking the accuracy and validity of your data after it comes into your organization and during the entire ELTL process. Measuring data quality makes the difference between good and bad data.

In general you want to measure these five qualities:

Accuracy. The extent to which your data depicts real-world entities, occurrences, or trusted references.

Completeness. The extent to which data that can feasibly be captured is not null.

Consistency. The degree of alignment or uniformity between your dataset and a reference dataset.

Uniqueness. The absence of an duplicate objects or events within your datasource.

Validity. How closely a data value aligns with expected or predetermined values

According to: 5 criteria of data quality and how to test for them (tinybird.co)

When do we measure data quality?

ETL wit data quality checks

You want to run quality checks at multiple points in your ELTL pipeline:

  • When your raw data comes in to check that it has the information you expect, values are not missing and the data is valid.
  • After you transform it, to check your own data transformation is giving you the result you expect.
  • After you have loaded it to the final destination in your pipeline and before you present it to the end user and.

What is soda/soda core?

Soda Core is a free, open-source Python library and CLI tool that enables data engineers to test data quality. Accessible along with its documentation, you can download the CLI tool or import the Python library to prepare checks for data quality.

Available for free

Compatible with basic SodaCL checks and configurations

Supports connections to 18+ data sources, same as Soda Library

Getting it running locally (with astronomer.io)

  • The Python dependency hell
  • Writing quality checks for dataframes
  • Best practices

The Python dependency hell

When I am writing this article soda-core-pandas-dask, the package you need for running quality checks on pandas dataframes, is in experimental mode. Meaning, it’s not working with the latest (9.1) astronomer Airflow distribution.

It’s not even working the 8.x distributions and it’s complaining about a missing Rust compiler, missing setuptools-rust libraries and apparently pip needs an upgrade.

I tried to fix all of these by adding them as dependencies, changing the Dockerfile and even creating custom virtual environments for my local Airflow. Nothing worked.

The final attempt and also solution was to downgrade to an older astro-runtime 6.0.2. And suddenly all went well.

If you want to experiment with soda-core-pandas-dask on astronomer locally I recommend you downgrade or try the requirements.txt further down. Hopefully the issue gets fixed soon.

Writing quality checks for dataframes

To get started with Soda you can use the very basic template below:

from soda.scan import Scan
import pandas as pd
import pendulum

from airflow.decorators import dag, task

@dag(
schedule=None,
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
)
def tutorial_soda_pandas():
"""Template on how to run data quality checks on Pandas dataframes using soda-core."""

@task()
def soda_scan():

# Create a Soda scan object
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("dask")

# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["a@soda.io", "b@soda.io", "c@soda.io"]})

# Add Pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)

# Define checks in yaml format
# Alternatively, you can refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
checks for employee:
- row_count > 0
"""

scan.add_sodacl_yaml_str(checks)

scan.set_verbose(True)
scan.execute()

soda_scan()

tutorial_soda_pandas()

The quality checks in Soda are defined using a domain specific language and very easy to understand:

# Checks for basic validations
checks for dim_customer:
- row_count between 10 and 1000
- missing_count(birth_date) = 0
- invalid_percent(phone) < 1 %:
valid format: phone number
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
- duplicate_count(phone) = 0

I recommend you to get started with the open-source Soda Core library which has the following properties:

✔ An open-source, CLI tool and Python library for data reliability
✔ Compatible with Soda Checks Language (SodaCL)
✔ Enables data quality testing both in and out of your data pipeline, for data observability and reliability
✔ Enables programmatic scans on a time-based schedule

Best practices

  • Get to know Sodas features and the available database connections
  • Run tests using the other soda-core libraries if possible
  • First create your infrastructure/pipeline and then integrate checks
  • Work with failures and warnings to manage the different outputs from the quality checks
  • Define the most important features of your dataset
  • The experimental version of soda-pandas is due to change so just chill

Bonus: Getting it working on Azure Managed Airflow

First of all it’s absolutely possible to get an ADF Airflow instance running with soda-core for Pandas. It’s just not working out of the box. Bummer!

How to get it working:

As I mentioned before, there are currently Python dependency issues (Oct. 2023). So what we need to get it running is the proper Python environment. Currently Managed Airflow in Azure runs on Python 3.8.17. This Python version is perfectly compatible with the soda-core library we use. The problem are the other packages and their versions. To get soda-core-pandas-dask running I used the following requirements.txt that I added to the linked dags folder in blob storage:

https://github.com/datafairy-azure/airflow/blob/main/soda-requirements.txt

As you can see soda-core is already included. If you want to be sure it works first, remove the two soda-core packages from the requirements file and add them once the instance is running with this setup.

Adding dependencies in ADF Airflow manually

The Airflow instance might take some time to process the updates. Once it is running and shows no dag import errors you can run the code above to test soda.

Summary

In this article we looked at data quality and how to integrate it in a ELTL pipeline using Apache Airflow and Soda for Pandas dataframes. As this is still in experimental mode we also looked at how to get it running on Azure Managed Airflow using a custom Python requirements.txt.

If you found this article useful, please follow me.

--

--

DataFairy

Senior Data Engineer, Azure Warrior, PhD in Theoretical Physics, The Netherlands. I write about Data Engineering, Machine Learning and DevOps on Azure.