Ensuring Data Integrity with DBT Testing

Published in

Israeli Tech Radar

6 min readFeb 28, 2024

Maintaining the integrity and quality of data transformations is essential in data engineering. This article aims to simplify the testing landscape in DBT.

DBT offers a framework for testing data models, sources, etc., ensuring they meet the standards of quality and reliability. Two primary types of tests play a key role in this process: Generic Tests and Custom Tests.

Some of my colleagues have described Generic Tests as a “Column test״ due to its focus on value-oriented. In contrast, the Custom Tests are described as “Row tests״, which are more business-oriented. let’s delve into the details of each type.

Generic Tests

Generic Tests in DBT serve as a powerful mechanism to ensure data integrity and quality within your pipeline. By implanting these tests directly into your DBT project’s YML files, you can enforce strict validation rules across various components such as models, sources, snapshots, and seeds. These tests are designed to monitor your data across key quality dimensions like accuracy, completeness, consistency, and reliability, ensuring that your data assets are trustworthy and dependable for decision-making processes.

The ease of configuration and application of these tests makes them essential to modern data engineering practices. By leveraging DBT’s Generic Tests, teams can automate the process of data validation, significantly reducing the risk of errors. This not only streamlines the data transformation process but also supports the overall robustness of your data ecosystem, enabling a more efficient and error-free environment.

Common Generic Tests include

not_null: Verifies the absence of NULL values in a column.
unique: Ensures all values in a column or a combination of columns are distinct.
accepted_values: Checks if column values fall within an acceptable range.
relationships: Validates foreign key constraints by ensuring column values match those in another table’s column.

Exploring the dbt_expectations package can provide additional test functions, improving your data quality assurance toolkit.

version: 2

sources:
  - name: org_books
    database: test_db
    tables:
      - name: books
        tests:
          - dbt_expectations.expect_table_column_count_to_equal:
              value: 3
        columns:
          - name: book
            tests:
              - dbt_expectations.expect_column_to_exist
              - dbt_expectations.expect_column_values_to_be_of_type:
                  column_type: date
          - name: channel
            tests:
              - dbt_expectations.expect_column_to_exist
          - name: author

Macros

In addition to these predefined tests, DBT offers the capability to create custom Generic Tests using macros. Macros in DBT are similar to functions in traditional programming languages. They enable you to reuse logic across your DBT project, making your code more modular and maintainable. In the context of testing, macros can be used to define custom Generic Tests that extend beyond DBT’s built-in capabilities.

The purpose of using macros for testing is to provide a flexible and powerful way to implement complex data validation that can be parameterized and reused across different models. This approach promotes the DRY (Don’t Repeat Yourself) principle, reducing redundancy and making your tests easier to maintain and update. For example, you could create a macro that performs a more complex data quality check, such as verifying that the values in a column stick to a specific format or pattern, and then apply this test across multiple sources or models in your project.

{% macro test_value_within_range(model, column_name, min_value, max_value) %}

with validation as (

    select
        {{ column_name }}
    from {{ model }}
    where
        {{ column_name }} < {{ min_value }} or
        {{ column_name }} > {{ max_value }}

)

select * from validation

{% endmacro %}

version: 2

models:
  - name: my_model
    columns:
      - name: age
        tests:
          - test_value_within_range:
              min_value: 0
              max_value: 120

Custom Tests

Custom Tests in DBT, also known as “unit tests” among some engineers, serve an essential role, particularly in situations that involve complicated regular expressions, conditional logic, and transactional operations. These specialized SQL queries are organized within the tests directory of a DBT project, their primary purpose being to ensure the integrity and validity of specific conditions or business rules. For example, such a test might be designed to show whether the total sales figures for the day recorded in one table are in alignment with those documented in another table, or it might be tasked with verifying that the application of business logic during the process of data transformation is executed flawlessly.

By weaving these critical checks into its regular execution routines, DBT empowers development teams to handle potential issues at an early stage in the development cycle. This approach to data testing not only enhances the reliability of the data transformation process but also significantly reduces the risk of data inaccuracies and inconsistencies that could arise further down the line.

Allowing teams to custom-tailor tests that suit the unique requirements of their specific data models and business logic, DBT enhances the precision and effectiveness of data validation efforts. These tests provide a safety net, ensuring that the data transformations being performed guarantee the accuracy and trustworthiness of the resultant datasets. Furthermore, the integration of these tests into DBT’s automated execution cycles allows teams to maintain a high level of data quality and integrity throughout the project lifecycle. This continuous validation mechanism is contributing to more informed decision-making processes based on data that has been rigorously tested and verified.

dbt-unit-testing

The dbt-unit-testing package enriches the DBT ecosystem by introducing a structured approach to unit testing, allowing data teams to verify individual components of their DBT projects. With features like mocking and expectation setting, this package enables the isolation of specific DBT models or transformations. This isolation is crucial for testing complex logic or scenarios where production data might be inaccessible or unsuitable for testing purposes. The package’s support for user-friendly formats like CSV for mock data and expectations further enhances readability and maintainability, making it easier to understand, update, and communicate test cases. Overall, the dbt-unit-testing offers an additional layer of confidence in the reliability and accuracy of data transformations

{{ config(tags=['unit-test']) }}

{% call dbt_unit_testing.test ('Author','') %}
  {% call dbt_unit_testing.mock_source('org_books','books', {"input_format": "csv"}) %}
   book,channel,author,
   'The Fellowship of the Ring','Amazon','Tolkien'
   'The Two Towers','Amazon','Tolkien'
   'The Return of the King','Amazon','Tolkien'
   'The Last Wish','eBay','Andrzej Sapkowski'
   'Sword of Destiny','eBay','Andrzej Sapkowski'

  {% call dbt_unit_testing.expect({"input_format": "csv"}) %}
    author, books,
    'Tolkien','The Fellowship of the Ring, The Two Towers, The Return of the King'
    'Andrzej Sapkowski','The Last Wish, Sword of Destiny'
  {% endcall %}

{% endcall %}

Implementing Tests in Your Workflow

The application of testing in DBT is not a one-size-fits-all, it varies based on organizational needs and project specifics. However, three critical stages often require focused testing:

Development

Testing is a crucial step in our development cycle when developing new source data or data transformation processes. During this phase, we will engage with the two categories of tests previously discussed. To verify the accuracy of a newly created data source, we will use Generic Tests. On the other hand, Custom Tests evaluate the logic behind our data transformations. It’s important to note that there isn’t a predefined checklist for implementing Generic or Custom Tests. The selection depends on your raw data and business requirements. Nonetheless, understanding the fundamental principles of each test type is an excellent starting point for establishing a solid foundation in data quality management.

Continuous Integration (CI)

CI practices are applied in projects to automate the testing and integration of data models as they evolve. When changes are committed to a version control system (like Git), a CI pipeline can automatically run DBT commands to test and build the models in an isolated environment. This includes running DBT’s built-in tests, such as Generic Tests and Custom Tests, to ensure that new or updated models meet predefined quality standards and not introduce regressions or inconsistencies.

CI aims to catch issues early, before changes are merged and deployed to production, ensuring that the data transformations remain reliable and accurate as the project evolves.

test.yml on GitHub Actions

on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: 3.11
      - name: Install DBT
        run: pip install dbt
      - name: Run DBT test
        run: dbt test

Production

Here’s a little insider tip: unexpected challenges and scenarios will inevitably arise in production, ones that you couldn’t have anticipated. The real skill lies in identifying these issues before they become apparent to others.

You will want to run automated tests to ensure that everything in your data and your logic is working as expected.

Monitoring and alerting are crucial in this phase to quickly identify and address any issues with the DBT, such as failures in data loading, transformations, etc. The production environment is optimized for performance and reliability, ensuring that the transformed data is available and accurate for decision-making processes.

In this phase, the main goal of the tests is to let you be the first to know and act.

Conclusion

Working with DBT and implementing both Generic and Custom Tests is a critical part, and not an easy one, of maintaining data integrity and quality in any data-driven organization. The variety of tests available in DBT, from the predefined Generic Tests to the more flexible Custom Tests, offer powerful tools to ensure that data transformations are reliable and meet the specific business requirements.

Through testing and continuous monitoring, teams can proactively address issues, maintain high data quality, and support better decision-making processes.