Data Quality Monitoring with SQL and dbt for Public Health KPIs

7 min readAug 23, 2023

The concept of “data quality” has been around as long as humans have gathered data, so the idea that data quality is vital is not novel. High-quality data is accurate, reliable, “fit for use,” and relevant to serve its purpose. Data quality is critical for effective service delivery, decision-making, and continuing program monitoring. Data quality leads to accurate reporting, resulting in accurate decision-making.

The Digital Transformation Office (DTO) of the Ministry of Health (MoH) has implemented the Data Quality Guidelines established by the World Health Organization (WHO) on about 61 public health KPIs termed IKK/IKP (Indikator Kinerja Kunci/Indikator Kinerja Program). These indicators encompass various public health areas reported by different programs within MoH, such as Maternal Health-ANC 1st Visit, Tuberculosis related to treatment success rate, Malaria-number of confirmed malaria cases reported, health worker density and distribution, and immunization coverage. Those 61 indicators became the backbone for Ministry of Health (MoH) stakeholders in policy making.

As we migrated to automated data collection via ingestion to our data warehouse, often with highly granular data (as opposed to manual obtaining of aggregate data via spreadsheet), the volume of data received grew significantly. Manual data quality assessment was once feasible, but it lately takes far too long. The amount of time that we can devote to doing more public duties.

This post will explain how we monitor data quality at scale in accordance with WHO guidelines.

Assessing Data Quality with DQA-WHO Framework

Based on WHO Data Quality Assurance documents, numerous public health indicator data quality evaluation criteria exist. However, because the use of routine data is new, we decided to concentrate on three primary criteria as a pilot project: accuracy, completeness, and consistency.

To evaluate the quality of data at the subnational level (e.g., province), we often compare it to data at the national level. Outlier subnational level data may be considered poor quality and must be reported to their respective owners in provincial/regency/city level Health Offices (Dinkes).

Below is a whole discussion of each data quality dimension.

However, before proceeding, we should note that meeting 100% of the data quality requirements with real-life data is almost impossible. For example, 90% completeness allows the data to be used if 10% is missing. The threshold, to some extent, depends on the business context. The higher the threshold, the higher the data quality for your use case.

Implementing the Data Quality Monitoring

The IKK/IKP data was gathered from several databases and API services and then loaded into our data warehouse. These data have varying granularity, data models, and so on, which requires standardization before determining data quality. To further complicate issues, 61 IKK/IKP indicators must be scored. We need a way to automate the scoring because creating a new query for each indicator would be time-consuming and error-prone.

SQL Implementation

Before we get into the data quality assessment query, keep in mind that we have 61 different types of IKK/IKP indicators. Some indicators were submitted as aggregated numbers in spreadsheet files by the provinces. Meanwhile, indicators that are derived from information systems that provide individual data must be aggregated.

Therefore, we have to standardize the data model before we assess its quality. For each indicator, we aggregate the value by time (in this case year and semester) and region (in this case provinces). After we standardize the data model, now it’s time to implement the data quality assessment as per the WHO’s guidelines.

For accuracy assessment, we compare each province’s value with the national mean and standard deviation for that year. If the province’s indicator value is above/below 2SD of the national mean, then the data from that province is considered inaccurate data.

In the consistency assessment, we expect that most public health data have a similar trend/trajectory over the year compared with the national data. We do this by calculating the ratio between the current year’s data vs. the average of the last 3 years at the national level. Then we calculate the same ratio for the provincial-level data. Finally, we compare the ratio of each province with the national level ratio.

If the province’s ratio is above or below 33% of the national’s ratio, then we consider it as inconsistent data.

Completeness assessment is quite straightforward. We simply tally up the number of cities submitting the data out of all cities in a province. If less than 80% of the cities in a province are submitting the data, then we consider it as incomplete data.

Automation in dbt

There are 3 data quality metrics: accuracy, completeness, and consistency and 61 IKK/IKP that needs to be assessed. In short, we will be running 3 x 61 = 183 queries to assess the data quality of all IKK/IKP. Copy-pasting these 61 indicators into 3 data quality assessment manually tends to be tedious, and can lead to several potential issues:

Complicated queries
Error-prone and difficult to debug
Time and resource consuming

Therefore, we need to automate these queries by creating a data processing flow that results in a final product, which is a data mart/table for each data quality assessment.

To achieve this, we are utilizing Data Build Tools (dbt). By using dbt, we will establish dependency tasks/data flows, allowing the queries to be executed gradually. Below is an example diagram of the data processing flow for Accuracy measurement, which also applies for completeness and consistency metrics.

To minimize the repetition, we leveraged the Jinja Template to create loops for the 61 queries for each data quality measurement. Instead of writing and repeating a large number of SQL query lines that are prone to errors and difficult to debug, we can condense hundreds of lines of SQL queries into just a few lines of code. Here is the Jinja looping template that we used in the project. (Note: for testing purposes, we divide 61 indicators into 3 staging tables, hence the 1st looping only contains 21 loops)

The Jinja template above will produce a staging table that contains 20 indicators. Since we have 61 indicators, we will have 3 staging tables. We just need to do UNION to have the full list, and then do the data quality assessment on top of it (in this case, we’re doing the accuracy assessment).

Result

After the data quality measurements have been thoroughly clarified and dbt automation has been established, the next step is to represent the quality metrics to show data quality metrics and trends through the quality monitoring dashboard so stakeholders and data users can frequently monitor them.

The main page of the data quality monitoring dashboard (Note: confidential metrics are hidden)

The dashboard displays the percentage of the total number of health indicators that accomplish the criteria of accuracy, completeness, and consistency. Moreover, the data quality dashboard provides the proportions of data quality that fall within the categories of “poor” and “good” for each metric throughout the 61 health indicators, spanning various years and semesters in an aggregated form, with an awareness of their data quality metrics and trends. With this monitoring dashboard, we can follow up any specific indicator that has low quality to their respective owners in the Provincial Health Office level.

A dashboard example of accuracy at the provincial level shows low-accuracy data from two provinces (red), which can be followed up by the users. (Note: Province names are hidden)

Previously, validating the data quality was done through manual work by using a spreadsheet application, which is exhausting and time-consuming. The application of automated data quality assessment can save -90% of time spent doing manual calculations. Now we can spend more time on following up with the provinces and creating policies to improve the data quality in the future.

The assessment of Data Quality is the most crucial step in delivering high-quality, meaningful work that can enhance the decision-making process as a whole. Data quality measurement implies substantial work; however, it serves as a fundamental determinant in harnessing the full potential of data.

When logical rules are established to execute the data quality monitoring process automatically, automating data quality allows for time savings and an increase in productivity. Furthermore, establishing a data quality monitoring system can extend data ownership beyond; thus, data users become aware of and engage with data quality.

The article was written by: Edwina Agustin, Aviandito, M Abdurrahman Shidiq, Dennis Ardiansya, Alfisyah Prisianda Pangestu Utomo, and Fariz Wakan