Data Warehouse Testing Strategies for Better Data Quality

Published in

Policygenius

10 min readMay 24, 2023

Policygenius is America’s leading online insurance marketplace. Our mission is to help people get insurance right by making it easy for them to understand their options, compare quotes, and buy a policy, all in one place.

Background

A previous post provided a high-level overview of the ELT (Extract, Load, Transform) architecture at Policygenius. As the number of transformation jobs has grown significantly over the last few years, an increasing number of data quality issues have emerged. These issues not only cost developers valuable time as they try to fire-fight but also impaired data stakeholders’ trust and ability in making effective business decisions. In this post, we will discuss the challenges associated with these data quality issues and strategies for tackling them.

Generally speaking, Data Engineers are responsible for extracting and loading data from various sources into the data warehouse (BigQuery), while the ownership of transformation jobs is shared among the Data Engineering and Data Science teams. The development workflow for these transformation jobs is relatively straightforward: add/modify SQL query, push to GitHub, test in the development environment, request a code review, merge the pull request, and deploy to production (See the workflow diagram below). However, this workflow is not particularly quality-oriented for the following reasons:

Data product developers are required to manually validate queries by gut-checking production results, as the development environment has limited data.
The QA process across the team is not consistent. Some data product developers are more meticulous than others, with some copying and pasting the validation queries in the pull request and others using an isolated view, which can make it difficult to track down and reproduce the results.
The data that was validated at the time of development can drift over time due to the constantly evolving production data.
Time-to-detect of data issues is often very long in production, estimated at around one week, and these issues are usually reported by a data stakeholder who notices the report is off.
Sometimes when the data product developers believe they have found the root cause of a data issue, they push a patch to production and hold their breath, hoping that it will solve the problem. However, regression often occurs, with patching one issue causing another. This creates a vicious cycle that I am sure every data professional can relate to to some degree.

*Figure 1: Original Development Workflow.*

Turning the Tide

In Q3 2022, we decided to address our data quality issues head-on. First, we need to understand the types of data quality issues we commonly face:

Bad logic/bugs in the data warehouse views: Pretty common, ranging from a tiny typo to incorrect business logic. This problem is exacerbated by the fact that some transformation jobs contain hundreds of lines of legacy SQL code, and it can take hours to unravel what the code does. While we would prefer to write fewer lines of cleaner SQL code, sometimes we need to get the job done quickly to support business reporting.
Contextual change from upstream sources: Contextual data changes, also known as semantic changes, refer to how data is stored in the database. These changes are difficult to detect because they don’t immediately break things, unlike changes to the data structure. However, when unexpected data creeps into the data warehouse and breaks the initial assumptions, datasets and reports can start to drift. For example, we experienced one incident where one unexpected duplicate record snuck into a low-level table seven months after the initial validation, and it was double-counted 320 times in the final table used for key business reporting.

There are two approaches we can take to tackle these quality issues, either from a process perspective or a technology perspective. This post will focus on the latter by introducing better tools with the following design requirements:

Requirement 1: Data product developers should not write cumbersome custom scripts. We don’t want to add too much development overhead to the data product developers, so the tool needs to be generalized enough to cover various quality use cases.
Requirement 2: Data product developers should be able to use an interface that is similar to their current workflow. Data product developers include Data Engineers, Data Scientists, and Data Analysts with varying degrees of skill-set, so we want to find a common set of tools. If new tools are introduced, it should be as easy as possible to reduce the cost of onboarding.
Requirement 3: The cost of switching to other solutions, should it become necessary in the future, should be low. We want to abstract away the implementation or choice of tools behind the interface from Requirement 2.

With the requirements in mind, we have introduced two testing tools below to ensure development and post-development data state adhere to certain standards. The figure below illustrates the new steps in the developers’ workflow for integrating these two tools.

Figure 2: New Data Development Workflow. Blue: Current workflow. Yellow: Additions to the existing workflow.

SQL Unit Test

Given an input row ABC, we expect to see a row DEF after 500 lines of SQL transformation.

Most of the current open source Data Quality tools/vendors that we looked at, such as Great Expectation and dbt, focus on checking the state of the end products in an aggregated form. These types of checks are certainly important for monitoring the quality of production data, which we will address in the Production Observability section below. However, there is a missing link in the SQL development phase. We want to “shift left” by moving testing as early as possible to the development phase (See the first two additional steps in the New Data Development Workflow above). Ideally, we would like to be able to test “Given an input row ABC, we should expect to see a row DEF after 500 lines of SQL transformation.”

We believe that unit testing is a good fit here. It is a widely adopted testing method among software developers but has yet to gain traction among the data community due to a lack of mature DevOps culture and available tools. Many data product developers are used to diving straight into writing SQL, testing, and rewriting it if issues are found. Instead, we want them to create mock-up test cases before they write any SQL code so they can focus on expected behaviors/outcomes rather than diving straight into implementations. Unit testing is well suited for views with complex business logic and can help ensure that the code produces the results as expected and reduce the likelihood of regression down the road.

For example, one of the transformation jobs uses BigQuery User-defined Function (UDF) for amortizing fixed marketing spend based on a decay rate. UDF opens so many more possibilities for the SQL-based transformation jobs, but they could become difficult to understand and maintain for the Data team. In this case, a unit test is set up with an input spend record of $100 below:

[
  {
    "report_date": "2022–09–01",
    "spend": 100
   }
]

Assuming a daily decay rate of 0.4, we expect the following output to be returned after the UDF transformation in BigQuery:

[
  {
    "amortized_spend": 40,
    "remaining_spend": 60,    
    "amortized_spend_date": "2022–09–01",
    "day": 0
  },
  {
    "amortized_spend": 24,
    "remaining_spend": 36,
    "amortized_spend_date": "2022–09–02",
    "day": 1
    },
  {
    "amortized_spend": 14.4,
    "remaining_spend": 21.6,
    "amortized_spend_date": "2022–09–03",
    "day": 2
  },
  {
    "amortized_spend": 8.64,
    "remaining_spend": 12.96,
    "amortized_spend_date": "2022–09–04",
    "day": 3
  },
  {
    "amortized_spend": 5.184,
    "remaining_spend": 7.776,
    "amortized_spend_date": "2022–09–05",
    "day": 4
  },
  {
    "amortized_spend": 3.1104,
    "remaining_spend": 4.6656,     
    "amortized_spend_date": "2022–09–06",
    "day": 5
  },
  {
    "amortized_spend": 1.86624,
    "remaining_spend": 2.79936,
    "amortized_spend_date": "2022–09–07",
    "day": 6
  },
  {
    "amortized_spend": 1.119744,
    "remaining_spend": 1.679616,
    "amortized_spend_date": "2022–09–08",
    "day": 7
  },
  {
    "amortized_spend": 1.679616,
    "remaining_spend": 0,
    "amortized_spend_date": "2022–09–09",
    "day": 8
  }
]

Production Observability

Given a table A, we expect column B to be unique or column C’s value fall in the range from XXX to XXX.

In addition to the shift-left unit testing approach above, Data Observability focuses on post-deploy monitoring to ensure that the production data satisfies certain assumptions, such as “Given a table A, we expect column B to be unique or column C’s value fall in the range from XXX to XXX.” Since production data is constantly changing, it is nearly impossible for developers to catch all edge cases during the development phase. Regularly executing data quality checks in production can help detect issues earlier and more proactively, rather than having to rely on stakeholders to report issues days later. Below are the common types of data quality checks we have enabled:

Unique constraint: For example, the table `application` should be unique at (`product_type`, `application_id`).
Cross table check: Checks if a column in the current table aggregates to the same value of a column in another table, with a XX% margin of tolerance. For instance, the difference between SUM(`table_1.spend`) and SUM(`table_2.spend`) should be within the margin of +/-5%.
Nullable check: Checks whether a column can have null values or not.
% nullable check: Sets a threshold of the maximum percentage of null values expected to see in a column.
Enum check: Checks if the value is one of the expected values. For example, the column `product` should be one of [“life”, “disability”, …].
Distribution check: Checks if the value of numeric types, including integers and floats, matches a range. For example, annual premium should be in the range of 0–1,000,000 and the median of annual premium should be in the range of 1,000–10,000, etc.

*Figure 3: An example of the transformation jobs with the data quality checks.*

Any instance of data quality checks can have the following types of alerts.

Hard-fail: Sends a PagerDuty incident to Engineer On-Call (EOC) from the Data team. It’s commonly used in strong integrity checks like unique constraints.
Soft-fail: Sends an alert to a designated Slack channel without creating an incident, but the EOC can declare an incident if needs be. This type of alert is useful for failures that are good to keep an eye for, for example, % nullable check on a column.
Silent-fail: Logs in DataDog only and alerts if the failure exceeds a certain threshold, but it does not send to any Slack channel or create an incident. This is the most tolerant data failure and can be used during the initial prototyping and iterations so it doesn’t create much noise.

Implementation

Data product developers use YAML for configurations (see example below) and JSON for seeding the unit test cases. We chose YAML and JSON because they are declarative and commonly used among the Data team, which eliminates the need to write custom Python code for tests and satisfies the first two requirements mentioned earlier. We use pytest, bq-test-kit and some lightweight custom script to dynamically generate and execute the tests. Unit tests are automated as part of the CICD pipeline, and Data Observability checks are run periodically as part of the transformation jobs in production. In the future, we might consider migrating the testing framework to Great Expectations or other similar tools if there is a strong fit. However, that should not change the current interface that data product developers use, as directed in the third requirement.

When we initially launched the tools, we noticed that developers were spending a significant amount of time on tedious tasks such as creating folder structure and copying input data. This became time-consuming, especially for views that referenced 20 or more upstream tables. To address this issue, we developed a scaffolding tool that auto-fills the configuration files and input/output data, resulting in a significant reduction of development time. This tool received very positive feedback from users.

As expected, it took a while for the data product developers to adopt the test-first mindset. We scheduled several rounds of training and pairings to get them used to the new workflow, and they began to appreciate it more when they realized “Oh, it’s actually working.” For example, the unit test immediately revealed a calculation error in the initial launch. It also happened multiple times that when an update was deployed to production but the observability checks failed, we were immediately notified, rectified the situation, and contained the blast radius before it became too widespread.

An example of the view config file with unit test and observability checks enabled:

- -
description: "One row represents an application and its information."
priority: 1
frequency_in_cron: "3 13 * * 1–5"
fields:
  - name: application_id
    description: PK - This unique identifier represents a single application.
    type: INTEGER
  - name: created_at
    description: When the application was first created.
    type: DATETIME
  - name: customer_id
    description: The customer that the application belongs to. A customer can have multiple apps.
    type: INTEGER
  - name: product_type
    description: The business product (e.g. life, home and auto).
    type: STRING
test:
    input:
      - table: "{project}.policygenius.life_applications_view"
        schema:
        file_path: postgresql/schemas/policygenius/life_applications.json
      - table: "{project}.policygenius.users_view"
        schema:
        file_path: postgresql/schemas/policygenius/users.json
validation:
  table:
    - check_type: unique_constraint
      expects: application_id, product_type
      fail_type: hard_fail
  columns:
    product_type:
      - check_type: nullable
        expects: False
        fail_type: hard_fail
      - check_type: enum
        expects: ['life', 'disability', 'homeandauto', 'renters']
        fail_type: soft_fail

Closing

It’s been over 6 months since we initially introduced the toolings. We now have a consistent and automated data testing process that covers about half of the transformation jobs. This not only reduced the risk of data breakages and decreased time-to-detect from days to hours, but it has also increased confidence of data product developers in deploying their code. Although we have not completely eradicated all data quality issues, we are empowering data product developers with more tools and a better development experience.

We are continuously expanding our test coverage within the Data team and exploring opportunities to expand the Data Observability framework upstream to Software Engineering, thereby increasing the coverage of our data end-to-end. Our work is not finished yet as we continue to innovate and identify ways to improve data quality.