An In-house Data Quality Project: Artemis

Published in

ÇSTech

3 min readJan 11, 2023

The name is purely a coincidence with the Nasa Artemis project :)

There are many automatized data quality tools in the field that organize your data and detect defects. But aside from detecting such as Null value percentages of a column, we encounter some error data based on local business requirements. To illustrate, the coordinates of a store may not match with its district, a price may not be equal to its calculation result, etc. These types of scenarios usually happen because of manual updates, system updates based on faulty software developments, or incorrect definitions. The issue will be resolved eventually when it is detected by those who use data. However, there is always a possibility of repeating the error. Our Artemis architecture will help us to track these possibilities.

Gathering Data Quality Findings

We do not expect to detect error data automatically. Due to the fact that data is “wrong by business rules”, it cannot be found before it happens. We do not have a business rules handbook (yet). For now, we expect an error data to be detected in the business development phase.

We designed a data quality control input page. Users can insert their findings here.

There is also a confirmation page that only authorized employees can access. Here, they check the data quality inputs, update if necessary and then approve or reject.

Evaluating the Findings

Then, these data quality findings and their approve status are gathered in the Google BigQuery data warehouse. Records are filtered based on the approval and control frequency. If the day is today, related tables and columns are checked with respect to the control script. Error record counts are also gathered in BigQuery, so we can open an automatic Jira ticket with data quality result details and actions to be taken.

Moreover, there is a possibility that data quality control finds no error records (in fact, we hope so :) ). To check the same control over and over again, having no result, may eventually cause unnecessary overload to tables. Thus, we have also another algorithm that checks the data quality control result logs. If there is no result for a period of time, that finding will be passivated automatically.

There is also another possibility that a passivated control scenario may occur at related column in time. So, we have another mechanism to check passivated control scripts every 90 days. If an error data is found, the related control record will be activated again.

Analyzing the Findings

Error record logs and automatically opened Jira ticket data give significant ability to the Business Intelligence team. There are many reports created to analyze error data behavior, opened Jira ticket counts and their resolve durations, and so on. These reports guide us to detect which errors occurred systematically, or which ones happened by human error.

Summary

With this architecture, we won’t need to re-check the error record scenarios that happened before. Artemis will check periodically and forward the results to related development teams with creating Jira issues. The more Artemis will be fed with data quality findings, the better errors will be corrected before they cause problems in business. In time, the “business rules handbook” will be created by itself.

Contributions:

Furkan Yusuf Pek for building control input interfaces and creating Jira tickets automatically with Python

Kaan Özbudak for developing all data transfer processes and project orchestration phases using Airflow.

An In-house Data Quality Project: Artemis

Gathering Data Quality Findings

Evaluating the Findings

Analyzing the Findings

Summary

Written by Murat Tonbuloğlu