A Case Study on Measuring Data Quality

Published in

Hippo Engineering Blog

5 min readNov 29, 2022

Authored by Tarrun Gunasekaran and Aaditya Bansal

About Hippo

Hippo is modernizing the $100 billion home insurance industry by putting customers at the center of everything we do, from the coverage we offer to the customer service we provide. If that wasn’t enough, it’s really just the beginning: Hippo’s true ambition lies in helping people prevent problems with their homes to begin with.

What is Data Quality ?

If you have ever worked with data, you would know that data quality is of paramount importance because it is important to have a trustworthy, accurate and consistent data for the company to analyze and make business decisions. Data Quality is not a one time project or initiative but an ongoing effort to continuously make the data better and more powerful to use.

Data Quality is an umbrella term that encompasses all of the factors that influence whether data can be relied upon for its intended use.

What are different Data Quality Dimensions:

Why Data Quality Metric?

We can give lip service to data quality but how do we actually measure it? At Hippo we came up with a measure or a score called data quality metric.

It’s a measure that is calculated based upon several factors to assure ourselves that we have a high quality of trustworthy data that the business users can rely upon for their analytics, Machine Learning and Data Science projects.

Our Current Data Quality Mechanisms:

To achieve the above Data Quality Dimensions we have the following:

DBT Tests

Generic Tests — unique, not_null, accepted_values and relationships are the four generic tests that comes along with dbt package. We will be defining these tests in yml file that will get translated into sql taking the model name and corresponding column names into account. You can read more about them here.
Singular Tests — Custom SQL written to check for business logic. E.g. To check if the policies are cancelled after certain date from its effective date.

Monte Carlo Monitors — Tool we use for detecting deleted tables, schema changes, freshness anomalies, volume anomalies and field health anomalies.

Our Journey for a Reliable Data Quality Metric

Data Quality Metric is a continuously evolving process. We developed several iterations of our Data Quality Metric until we were confident that it actually reflected our data quality.

Version 1.0

As our first step, we wanted to identify the most used table and verify if we have sufficient tests being defined on that table. In order to do that we did two major steps in this version.

Get key assets from Monte Carlo — Monte Carlos assigns an importance score for each table in the data warehouse based upon the number of times the table has been queried. We use this score to get the top 200 queried tables.
Parse the DBT graph.gpickle — The graph.gpickle file is a DBT artifact that contains the metadata about all the models and all the tests defined for these models. We parse this file to collect the information on which columns the tests are defined for and what type of test are being defined for each column.

After combining above two info together, we came up with a rough estimate of Data Quality Coverage Score for top 200 most queried tables in our Data Warehouse.

Limitations of Version 1.0

Keys Assets from Monte Carlo show more than 100 tables with a score of 0.99. Without distinguishing tables from each other, it is hard to tell if one table is more important than the other.
Randomly defining a number for minimum tests required
Although we get a score, we are not able to see what the most important tables were and if they fulfilled the minimum tests criteria.
We just print out the score which isn’t visually appealing

Version 2.0

Instead of using Monte Carlo to define table importance we will use query logs from BigQuery to get table importance.
Use column coverage(# of columns in a table) to define minimum number of tests.

Steps for Calculating Data coverage:

Step 1 : Get the important tables from Big Query Information schema

Get the top 100 tables that is being queried last 30 days from the information schema

WITH base AS
                (
            SELECT
                r.dataset_id, r.table_id, count(*) AS total
            FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
            CROSS JOIN UNNEST (referenced_tables) AS r
            WHERE
                creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
                AND CURRENT_TIMESTAMP()
                AND split(user_email,'@')[SAFE_OFFSET(1)] = '<company-email-domain>'
                AND lower(query) NOT LIKE '%dbt%'
                AND r.table_id NOT IN ('audit_log', 'utils_dates')
                AND r.dataset_id IN ('<schema 1>', '<schema 2>','<schema 3>')
                GROUP BY 1,2
                ORDER BY 3 DESC
                limit 150
                )
            SELECT  b.dataset_id,
                    b.table_id as table_name,
                    b.total as query_count,
                    count(*) as column_count,
            FROM base b
            LEFT JOIN `region-us`.INFORMATION_SCHEMA.COLUMNS ic
            on ic.table_name = b.table_id and ic.table_schema = b.dataset_id
            GROUP BY 1,2,3
            ORDER BY 3 DESC

Step 2 : Define Number of Tests Based on the column

If number of columns > 200 then, minimum number of test required is 20
If number of columns < 200 then we require coverage on at least 10% of the columns. E.g. If there are 120 columns, then have a minimum of 12 tests defined for the table

Step 3 : Calculate Individual Table Coverage

If Number of Test >= Number of Test Required → 100% Individual Table Coverage
If Number of Test < Number of Test Required → Number of Test/Number of Test Required

Step 4: Calculate Total Quality Score

Individual Points Awarded for table = ( Individual Table Coverage ) * pct_total
Quality Metric Score = Sum of Individual Table Points

Below Illustration shows the above calculation for our top 10 used tables.

Our Next Steps

This blogpost provides the information for test coverage strategy we have implemented for our most critical tables. In our next step, we will account for dbt test failures, data warehouse outages or delays, and other data quality alerts. This will provide us with Data Quality Metric on a daily basis and we will be able to measure how our Data Quality Score and Data Quality Coverage score trend over time. Along with that we are also looking to improve the following:

Identify the tables [by user vs reports used by Tableau dashboard]
Post results to Big Query for further analysis
Getting a trendline in Grafana
Automate the process in Airflow
Better Visualization using Data Studio

We already have started working on next steps and will publish another blogpost after releasing new version.