Data Quality Highlight System
Data has already become an integral and fundamental part of many projects. Businesses take advantage of data in many ways. For instance:
- Advertising — more accurate and tailored ad targeting
- finance — fraud detection and prevention
- Healthcare — predictive analytics and anomaly detection
- Manufacturing — defect detection, and more.
Regardless of the domain, data is key to precise and effective decision-making, and its quality always impacts the results. Obviously, it is quite challenging (or even impossible) to build strong processes on and get great results with data of low quality. Hence, data quality plays a major role in Project Flow.
In my article, I’m going to explore the pipeline for data quality assurance using one of the projects where I’m involved as a software tester as an example.
A few words about the project: The goal is to capture and process data from mobile devices to improve contextual advertising. The data moves through several ETL nodes and can be modified multiple times — new features (e.g. extract hours from timestamp or city from geolocation) are constantly added to the data.
There are several characteristics of Data Quality:
- Completeness — The records have zero emissions; all the cells required are filled.
- Uniqueness — There aren’t any identical records in the data.
- Credibility — Nobody wants to work with data that can not be trusted (and here’s why it needs to be tested and validated). The cells of tables with quality data should contain: IP address, telephone number, etc.
- Accuracy — Talking about digital data, we rely on the exact number of characters. For example, 12 decimal places.
- Consistency — Data must retain values, regardless of how it is measured.
- Timeliness — The data should be relevant, especially if it’s updated periodically. For example, the amount of data should increase monthly. Data should not be obsolete. If we talk about banking transactions, we’d want the data to be stored for at least six months.
To learn more about the levels of data testing, check out my article.
According to this article, we do cross validation for many data params as non-null fields in table, duplicates, format matching (e.g. date, coordinate etc.).
The high level processes are described below:
We have several ETL nodes. Each ETL has a “quality gate” (QG) with its own test suites for ETL and Data quality. We use fast checks at this stage because we don’t want to freeze the flow. After the check is complete, all results are stored in the “metric system”. We keep ETL and Data bugs in different dashboards. It may be not the way to go in most cases, but it works for us. In a case when QG finds bug in the data, we define severity for this issue (we have some rules for this) and make a data sample. After this, the sample is sent to Engineers to fix the bug.
Usually, there are no bugs in ETL because ETL is tested before the release, and we try to keep each node as high quality as possible. Should any bugs in ETL be found after the release, it will be problematic to locate all the spots with failed or corrupted data.
After the data is fixed, we return the sample to the data lake and look at monitoring.
An ETL node can use Spark to transform the data. We test Spark jobs using unit and integration tests as well.
Bear in mind that it’s not 100% necessary to stop the data flow if a QA finds a bug. By default, we assume that data may contain a few errors. However, Blocker and Critical bugs should be fixed ASAP. In the meantime, it makes sense to monitor the number of non-critical bugs and keep the list of those bugs around for reference.
As a rule, only fast tests are used in ETL, which are tests with execution time from several seconds to several minutes. Unfortunately, unit tests cannot ensure the good-enough coverage for Data Quality.
We use system tests and integration tests as described in this article .
For these tests, we should use as much data as we can. On the other hand, we cannot use all of the data because it will increase test time execution. As a compromise, we use TABLESAMPLE with different probability (usually 20–30%). It allows us to test different data samples.
For several system tests we are use so-called evergreen data. Evergreen data is data where we already know expected result, and because we can control each stage. Evergreen data is not used for performance or data quality. It’s used for testing ETL only. We have to add positive and negative cases to evergreen data. For me, it is the most difficult part. I have to keep in mind that data can be change and added data according to this.
In conclusion, I’d like to point out that I’ve only highlighted some of the ideas about data testing implementation so as you had basic understanding of the testing process. We continue to develop the system and add changes every day. The biggest challenge now is time — we are working to reduce the amount of time required for checks and to increase coverage.
At Provectus we have excellent Data QA expertise. Please visit the data quality page