Bringing Data Quality to Big Data

Dios Kurniawan
Life at Telkomsel
Published in
6 min readJan 28, 2021

--

Image source: Unsplash

On July 23, 1983, an Air Canada Boeing 767 jetliner had to make a hard emergency landing after it ran out of fuel while midair, halfway on its flight from Montreal. The pilots had glided the plane without engine power for fifteen minutes before it landed in a small airfield. The investigation following the incident proved that the ground crew entered wrong fuel quantity data into the onboard computer before the flight; the unit of measurement was in pounds instead of kilograms. As a result, the fuel loaded to the aircraft was only half of what is needed for the trip. No one checked the data. Miraculously, no passenger was seriously injured although the jet suffered heavy damage during the hard landing ¹.

Photo: Wayne Glowacki / Winnipeg Free Press

The story above is a glaring example of how skipping a data quality check might result in severe consequences. Data is an invaluable asset, but it is only good if it is of good quality. As data remains to become increasingly important for businesses to keep up with the competition, the need to assure the quality of data is paramount. Lack of high-quality data would undermine decision making, leading to lost opportunities, and worst…lost customers.

Telkomsel Big Data Platform

At Telkomsel, the big data platform is heavily used to support business operations and thus we take its data quality seriously. We are aware that quality issues would impair the value of the big data platform if not handled properly. Regular reports, business analysis, and data science models depend greatly on accurate data in large quantities.

For that reason, in the IT Business Intelligence and Analytics Group, we have devised our homegrown Data Quality (DQ) Framework. At the heart of this framework is a set of methods and metrics for detecting data quality problems as early as possible whilst the data is still within the pipeline. Early detection prevents bad quality data from reaching business user’s hands.

Most data quality issues mostly concern with missing records, inconsistent values, incorrect reference data, or simply data not produced at the right time. Some of these issues can be easily detected, but most require a more well-rounded approach.

The sheer amount and the complexity of the data in our big data platform prohibit us from performing too many data checks. Testing the entire dataset inside the big data platform which holds hundreds of tables in petabytes of storage would be an unrealistic and unsustainable strategy. Therefore, some types of sampling techniques must be applied.

Our DQ Framework introduces the notion of DQ Test Point. This is essentially a data probe in the data pipeline where samples are picked to get quality metrics. Within each test point, a sampled data point is compared against a set of baselines, and if it falls outside a specified threshold, an alert would be raised to notify users of an anomaly.

The DQ Framework specifies four major quality metrics inside the test points:

1. Completeness: measurement to validate if the data from the source is collected without loss. For example, the number of records is compared against the baseline extracted from the source. This metric is the first to be tested because completeness is a requirement that precedes all others.

2. Accuracy: measurement to indicate if the content reflects the correct value. Quality tests range from simple rules such as detecting malformed data format to more complex rules such as comparing KPIs with expected trends, validating data distribution stability, and outlier detection. Anything that fails the test would raise an alert.

3. Consistency: validation to check whether each KPI or specific individual data is consistent across multiple datasets. It ensures there are no conflicting numbers. For example, a test to check if weekly aggregates tally with the sum of daily aggregates.

4. Timeliness: measurement to see if the data available at the right time, and if it is not, to measure how severe the delay is. For example, if a data element crucial to producing a daily sales report is not available at 8 AM as scheduled, an alert will be triggered and the gap is measured.

The DQ Test Points are part of the larger Data Quality Management (DQM) system currently being built. There are currently fourteen DQ Test Points in our big data platform, starting from the ingestion process, transformation process, and all the way to the data distribution channels. These tests are run by DQM at regular intervals; hourly, daily, depending on the needs.

Figure 1: DQ Test Points in Telkomsel Big Data platform

Employing Data Science

How are those baselines produced? They are created using a forecast model on top of historical data, taking into account daily, weekly and annual seasonality such as weekends, holidays, and high/low seasons. Timeseries forecasting is mainly used here. A model will be built for each KPI and for each individual data source. For example, one will be created for KPI of revenue, another for a number of voice calls, another for internet traffic, and so forth.

Setting upper/lower thresholds correctly is also crucial; too narrow would result in many false alarms, too wide would result in many undetected issues. Dynamic threshold is added to the mix to cope with this matter. Machine learning techniques are used to learn each KPI’s normal behavior, incorporating human feedback as well. As more data is analyzed over time, these thresholds become gradually more accurate. This helps to minimize false positives, as well as allowing detection of an anomaly to happen at an earlier time compared to simple static thresholds. The example in Figure 2 illustrates how dynamic thresholding triggered an alert many hours earlier than if a static threshold would have been used instead.

Figure 2: illustration of dynamic thresholding for triggering alerts

Some sort of outlier detection algorithm is also employed on select data artifacts to make sure we do not miss out on significant deviation or hidden trends which warrant further human attention. This check is carried out in DQ Test Points just before the data leaves the pipeline.

Doing It at Scale

Telkomsel operates a big data platform which is arguably one of the largest on-premise Hadoop installations in the South East Asia region. Hundreds of terabytes of data are processed in this platform each day. Such a massive volume of data becomes an enormous challenge, making the implementation of high-performance architecture becomes an obvious necessity.

Apache Spark is mainly used in our DQ Framework, taking advantage of its large-scale in-memory computing capabilities. We developed PySpark programs to apply the DQ Test Points. In addition to that, we are now in the process of setting up Apache Griffin — which itself runs on Spark — for calculating even more quality metrics.

Not only produces alerts, DQM also provides long-term quality monitoring. The result of DQ Test Points is stored in a time-series database and is displayed on a DQ Scorecard which shows the “health” of each domain of data across a different periods of measurement (daily, weekly or monthly).

Figure 3: an example of DQ Scorecard

We Do Not Stop Here

A constant supply of high-quality data is essential in our data-driven culture at Telkomsel. Neglecting data quality would negate the purpose of having a big data platform in the first place for the business would obtain no value from the information asset.

However, data quality issues do still occur frequently; hardware malfunction, software crash, network interruption, people forgetting to update reference data, and 101 other things that can go wrong disrupting normal data flow. More often than not, we come across new quality problems that have not been identified. We have to regularly update the breadth and depth of DQ test points.

The business also persistently demands new data. New data sources, big or small, are added to the big data platform almost all the time. To meet changing business needs, we keep refining our data quality practices with new approaches, techniques, and tools. We are working at full throttle as there is still a lot of work to be done. Data quality management is not a project, it is a journey.

¹) National Geographic TV Documentary “Mayday”, episode “Gimli Glider”, 2008.

--

--

Dios Kurniawan
Life at Telkomsel

Big Data Analytics, Data Warehousing, Machine Learning, Software Development, Data Governance, Privacy and Protection