Data Reliability with Soda Core
Data is critical for all organizations and yet many struggles with getting quality data. The most common reasons are either lack of documentation on data reliability or tribal knowledge to validate assumptions. Often time, there is no concrete solution established at the organization level and this results in unreliable data. Unfortunately, this can have a significant financial impact on many organizations because they are not getting the right insights from their data.
There are plenty of solutions out there to mitigate data reliability and data observability issues. Soda Core is one of the less known yet powerful tools that are available in the open-source market. In this article, I am going to briefly explain what Soda Core is and demonstrate how to leverage this open-source tool to improve data quality testing both in and out of your pipeline.
What is Soda Core?
Soda Core is a command-line interface (CLI) tool that enables you to scan the data in your data source to surface invalid, missing or unexpected data. It is for data quality testing both in and out of your pipeline, data observability, and data monitoring. It works on both SQL tables and Spark DataFrames, and consists of several components like metric collection, data classification, check evaluation, and alerting. Soda Core uses the Soda Checks Language (SodaCL), a human-readable domain-specific language for data reliability.
Soda Core supports the following data sources, and you can find their configuration information on their official documentation page.
- Amazon Athena
- Amazon Redshift
- Apache Spark DataFrames
- GCP Big Query
- IBM DB2
- MS SQL Server
- MySQL
- PostgreSQL
- Snowflake
Get Started with Soda Core
To get started, you must have Python 3.8 or greater on your system. It is also recommended to use Python virtual environment to install the Soda Core. So let’s first create a virtual environment and activate it using the following commands:
python -m venv .venv
source .venv/bin/activate
Now that your virtual environment is ready, let’s upgrade its Python package manager, pip, to the latest version:
pip install --upgrade pip
Lastly, we need to install the Soda Core package depending on the type of data sources you need to work with. For this demo, I am going to use the soda-core-snowflake.
pip install soda-core-snowflake
Now we need to create the following two YAML files:
- configuration.yml
This file contains connection details for your data source. Soda Core uses this file to access your data source. It is recommended to leverage system variables to securely retrieve the value, especially sensitive data like passwords.
data_source your_datasource_name:
type: snowflake
connection:
username: "YOUR_USER_ID"
password: ${SNOWFLAKE_PASSWORD}
account: you_account.us-east-1
database: YOUR_DATABASE
warehouse: YOUR_WAREHOUSE
connection_timeout: 300
role: YOUR_ROLE
client_session_keep_alive:
session_parameters:
QUERY_TAG: soda-queries
QUOTED_IDENTIFIERS_IGNORE_CASE: false
schema: YOUR_SCHEMA_NAME
- checks.yml
This file contains a test that Soda Core performs when it scans a dataset in your data source. These checks are written using the Soda Check Language (SodaCL). There are plenty of built-in metrics available for us to use and you can even define your own using the SQL queries. In this demo, we are performing two checks.
- Iterating through each table, whose name ends with _DIMENSIONS in YOUR_SCHEMA_NAME schema and checking whether they are empty or not. We are excluding any table names that begin with TEMP_.
- The second check is to ensure that the date_id column is not NULL in the DATE_DIMENSION table.
for each dataset T:
datasets:
- include %_DIMENSION
- exclude TEMP_%
checks:
- row_count > 0checks for DATE_DIMENSION:
- missing_count(date_id) = 0
Finally, you are ready to run a scan against the data source by issuing the following command. You can find scan command options and their descriptions here.
soda scan -d your_datasource_name -c configuration.yml checks.yml
Once the above command is executed successfully, you should see the output as shown below.
Soda Core 3.0.4
Instantiating for each for ['CUSTOMER_DIMENSION','DATE_DIMENSION','PRODUCT_DIMENSION']
Scan summary:
4/4 checks PASSED:
DATE_DIMENSION in YOUR_SCHEMA_NAME
missing_count(date_id) = 0 [PASSED]
CUSTOMER_DIMENSION in YOUR_SCHEMA_NAME
row_count > 0 [PASSED]
DATE_DIMENSION in YOUR_SCHEMA_NAME
row_count > 0 [PASSED]
PRODUCT_DIMENSION in YOUR_SCHEMA_NAME
row_count > 0 [PASSED]
All is good. No failures. No warnings. No errors.
Where to Go from Here?
Now that you have working data quality checks with Soda Core, you can integrate it with a data orchestration tool like Airflow to automate and schedule your search for bad data. You can even extend the functionality by connecting it to the Soda Cloud, which allows you to use a metric store, collaborate with others, and use an issue resolution workflow. The growing SodaCL offers many built-in human-readable checks and its ability to define our own metrics to customize checks using SQL queries.
Conclusion
I feel this project has a lot of potential, especially since it allows you to run data quality checks against multiple data sources with minimal effort. It offers simple integration with orchestration tools, and it allows you to check for a variety of known and unknown root causes with ease. It is lightweight in terms of dependencies and the learning curve is straightforward. I am excited about the future of Soda Core.