Metadata Driven Data Quality Engine

Sarath Chandran
Litmus7 Systems Consulting
7 min readJun 16, 2023

Organizations have evolved substantially, either being data driven or having their entire business models based on capture and utilization of data. However, though, crucial to the business of most companies, there are a lot of issues associated with the quality of data that is available for insights and decision making. According to Forbes, Insights and KPMG, 84% of CEOs are concerned about the quality of the data they’re basing their decisions on.

Importance of Data Quality

For businesses, bad data can have serious financial repercussions. Substandard data quality is frequently blamed for operational blunders, incorrect analytics, and poorly thought-out company initiatives. Added costs when goods are delivered to the wrong customer addresses, missed sales opportunities as a result of inaccurate or incomplete customer records, and fines for improper financial or regulatory compliance reporting are a few examples of the economic harm that poor data quality can bring about. Use of business intelligence (BI) and analytics to boost the decision-making process in businesses is frequently stated as being hindered by corporate leaders and business managers’ lack of trust in the data. The general presumption that a data scientist spends bulk of their time in cleaning the data is also a testament in itself to the expected issues with data quality.

According to a survey by Gartner, “Organizations believe poor data quality to be responsible for an average of $15 Million per year in losses.” Gartner also found that nearly 60% of those surveyed didn’t know how much bad data costs their businesses because they don’t measure it in the first place.

Why does the quality of data degrade ?

The data being collected could be inaccurate in a variety of ways. It is a fact that we cannot anticipate 100% accuracy even when machines are utilized to collect data. We can anticipate certain inaccuracies when it is entered manually by people. Sometimes, because the manner the data was collected and gathered, now when the data is given for entry it already contains a number of inaccuracies. This further leads to incomplete or inaccurate information being collected.

There is also the situation of data that is unclear or that conflicts with other information gathered. We also have a problem with duplicate data. As the name implies, this occurs when a record in a database shares data with another record. The usage of this data for analytics results in erroneous reporting, lost productivity, and unnecessary expenditures. All of this makes it even more challenging to integrate when working with massive amounts of data (big data), and it also makes it challenging to assess the data’s quality. For data quality, no precise standards have been established. This is yet another factor making it challenging to guarantee data quality.

Existing DQ Solutions

Today, there are various data engineering tools that may be used to evaluate the quality of data. Every year, brand-new models with cutting-edge features are released. Tools are also available with functionality for data matching, data privacy, and cataloging. All of them are really effective tools that make data integration simple. The key drawback of these solutions is that proper user training is necessary to ensure that the business requirements are implemented correctly. It probably will not function effectively if the user is not confident in the specificity of the need. These technologies also have the drawback of requiring extensive programming and possibly even knowledge of specific computer languages. This is made worse by the fact that data analysts are typically the ones that utilize data quality evaluation tools, and that is almost at the far end of the data lifecycle whereas ideally one should aspire to capture the issue at source or when it enters the data processing systems through data pipelines. Additionally, the outputs these systems produce after doing a data quality review are challenging to comprehend or decipher.

Metadata Driven Data Quality Engine

Figure 1: Data Quality Framework

The metadata driven data quality engine is a pluggable library which verifies the quality of data once plugged into the ETL pipeline. The incoming data is validated against the schema as well as other configurable checks. Users can upload a sample data extract of the file which they would like to validate, into the portal. Schema (data type of each column, pattern of the date etc.) is inferred automatically and is displayed to the user. They can edit the values as needed. This schema is received as a JSON file from the simple UI created for collecting the different checks to be performed on the data. The JSON file is saved as a document with Unique File Identifier as the key. This includes validation based on, but not limited to, data type check, nullability and specific pattern for a particular column. There is an option to configure a threshold limit for the valid records received after the assessment. If the data contains valid records greater than the configured threshold limit, the file is marked as DQ certified and is moved to a success folder. If it does not contain sufficient valid records, the file is marked as DQ Rejected and ETL process is stopped. The failed records are saved to a failure folder. Further logs are available on the records which shows which are the checks that each column failed. Valid records from the succeeded file are then persisted to the cleansed layer for further analytics.

Figure 2: UI Wireframe
Figure 3: Editable Metadata Wireframe

The engine reads the metadata by passing the unique metadata id. This is followed by reading of the data and applying quality controls to each record. For every record, a validation result is generated based upon the filtering of valid records. The information regarding the audit shall be stored in the format as in Fig. 4 when there is an invalid record.

Figure 4: Audit log structure

The audit file is loaded into a Postgres table on top of which analysis queries are run to retrieve useful information like records for success and failures per file, most often failed column, most frequently unsuccessful constraint etc. These derived insights are also persisted in Postgres database tables which can be accessed via the UI.

Both Python Pandas and Spark dataframe are used by the DQ engine. When the Data Quality framework function is invoked, the user can specify the engine to be used — either Python or Spark. Accordingly, the corresponding execution framework is leveraged.

Wrap Up

The fact that anyone could use this to evaluate data quality through configurable checks is one of the reasons why you should use this pluggable library. This DQ engine has a lot of potential in the future. A more complex user interface could be provided, which would enable users to perform additional checks. The DB integration can also be documented using it. Additionally, this library can be enhanced with new functions and made available to users as a service. We can further provide the fields that need to be ingested into the final destination through the UI. This service can additionally include an ingestion facility where the data can be evaluated after it has been ingested.

In the modern world, where businesses use data to gain insights, the quality of the data is a key component. From the initial data collection to data ingestion to data storage, there are a wide variety of ways in which data may get inaccurate. There is no escaping this. The use of tools for determining data quality and processing them in line with that quality is the best way to solve this problem. The tools and techniques that are now available for evaluating data quality are too complicated and complex for commercial users. Our Data Quality Engine is intended to be a straightforward pluggable library that can be easy and quick to implement specific checks for evaluating the quality of the data. Depending on how many records exceed the user-specified threshold limit, the data evaluation will either be a success or a failure. The existing DQ solution developed has a working version of Python engine and Spark engine for Data Quality Assessment and report generation.

References

[1]https://www.forbes.com/sites/forbespr/2017/05/31/poor-quality-data-imposes-costs-and-risks-on-businesses-says-new-forbes-insights-report/?sh=7c906a94452b

[2]https://www.forbes.com/sites/forbestechcouncil/2021/10/14/flying-blind-how-bad-data-undermines-business/?sh=2eae9a9e29e8

[1] https://www.collibra.com/us/en/blog/what-is-data-quality

[2] https://www.confessionsofadataguy.com/data-quality-great-expectations-for-data-engineers/

[3] https://www.collibra.com/us/en/blog/the-7-most-common-data-quality-issues

[4] https://www.precisely.com/blog/data-quality/data-quality-problems-errors

[5] https://www.techtarget.com/searchdatamanagement/tip/Top-data-quality-management-tools

[6] Elouataoui, W.; El Alaoui, I.; El Mendili, S.; Gahi, Y. An Advanced Big Data Quality Framework Based on Weighted Metrics. Big Data Cogn. Comput. 2022, 6, 153. https://doi.org/10.3390/bdcc6040153

[7] Cai, L. and Zhu, Y., 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Science Journal, 14, p.2. DOI: http://doi.org/10.5334/dsj-2015-002

--

--