Simplifying Data Validation with dbt Data Audit
An overview of dbt-audit-helper
Validating and auditing data
Whenever data engineers refactor SQL queries or perform database migrations, the best way to verify, or validate, that the task was completed correctly is to compare data between the source and target. This data auditing process can be a hassle, and the amount of time and effort it takes depends on available debugging tools. Many hours can be spent tracking down where things went wrong (e.g., column mismatches). Having experienced this problem, a data analyst with the help of other community contributors packaged up the most used audit queries into a dbt package calleddbt-audit-helper
. It contains macros that simplify data auditing and make the process more robust and modular.
When auditing data, invocation_id
is just one of the simple fixes that greatly improves quality of life. For a more complete set of data auditing tools, we’ll be referencing dbt-audit-helper
.
Invocation ID
This dbt Jinja function is very valuable for incremental loads. For each compilation or run, {{invocation_id}}
will generate a unique UUID. Take the following macro as an example:
{% macro
%}
insert into {{ get_audit_run_relation() }} (
invocation_id,
runtime,
database,
warehouse,
user,
role,
completed,
dbt_version
)
values (
'{{ invocation_id }}',
{{ runtime() }},
'{{ target.database }}',
'{{ target.warehouse }}',
'{{ target.user }}',
'{{ target.role }}',
{{ completed }},
'{{ dbt_version }}'
);
commit;
{% endmacro %}
This was written before the days of dbt-audit-helper
, but you can now leverage the tools that it offers and not spend time writing basic custom functions.
Enter dbt-audit-helper
The common theme of data audits is comparisons, like that of test-driven development. While custom audit functions are useful, dbt-audit-helper
contains the essential functions required for any audit process.
A macro function such as the following compares the data types expected from a queried table. In the event that there are missing (or mismatching) columns (or values), the function generates an alternate table that essentially compares the differences between the expected and actual tables. This alternate table then can be used to demonstrate and diagnose the issue. The fundamental usage of audit functions, such as the one below, is to capture any errors, or bugs, from the data origin and queries.
{% macro default__create_audit_result_table() %}
{% set required_columns = [
["invocation_id", dbt_utils.type_string()],
["invocation_type", dbt_utils.type_string()],
["model_id", dbt_utils.type_string()],
["model_name", dbt_utils.type_string()],
["schema_name", dbt_utils.type_string()],
["model_tags", dbt_utils.type_string()],
["status", dbt_utils.type_string()],
["records_processed", "number"],
["execution_time", "number"],
["message", dbt_utils.type_string()],
["compilation_started_at", dbt_utils.type_timestamp()],
["compilation_completed_at", dbt_utils.type_timestamp()],
["execution_started_at", dbt_utils.type_timestamp()],
["execution_completed_at", dbt_utils.type_timestamp()],
["adapter_response", dbt_utils.type_string()]
] -%}
{% set audit_table = get_audit_result_relation() -%}
{{ create_alter_table_if_not_exists(required_columns, audit_table) }}
{% endmacro %}
The code above addresses column mismatches, but for mismatching rows you’d need additional logic to compare and list out what the mismatches are.
Summary
For future development, the dbt-audit-helper
package can provide macros that will diagnose any mismatching columns, types, and values. This is done by including the package in the packages.yml
file and running dbt deps
. The package currently covers the following:
compare_relations
(row-by-row validation of two relations)compare_queries
(comparison of two select statements)compare_column_values
(comparison of a column across two queries)compare_relation_columns
(comparison of the ordinal positions and data types of columns)