Data quality, the secret of good analytics

Alex Souza
blog do zouza
Published in
5 min readJan 21, 2022

Data Quality , or Data Quality , is related to how good the data stored in your organization is, that is, the more complete, accurate and consistent it is, the higher the quality of this data.

Data quality assurance directly impacts the organization’s business, taking advantage of all existing data to obtain new insights , in addition to supporting decision-making and other benefits that tools such as Business Intelligence and Machine Learning bring to the business.

How to implement and maintain Data Quality ?

There are several ways to adopt a Data Quality policy. One of them is using the Data Quality Lifecycle , which we will detail a little more in this article.

The Data Quality Lifecycle

The Data Quality Lifecycle is a sequence of processes that data quality projects go through from inception to closure. As shown in Figure 1, there are 6 steps and we will describe each of them below.

Data Discovery: Refers to requirements gathering, source application identification, data collection, organization and classification of data quality report.

Data Profiling: Refers to initial examination, sample data quality check, rule suggestion, and final data quality rule approval.

Data Rules: This concerns the execution of the final business rule to examine the accuracy of the data and its fitness for purpose.

Data Distribution & Remediation: Refers to the process of distributing data quality reports to responsible parties and initiating the remediation process.

Data Monitoring: refers to ongoing monitoring of the remediation process and creating data quality dashboards and scorecards.

The PyDeequ Tool

There are tools that can assist in this process, and we will comment on Deequ . Deequ is a library built on top of Apache Spark to define “unit tests for data”, which measure the quality of data in large datasets.

Dataset producers and/or data steward can add and edit data quality constraints. The system calculates quality metrics regularly (with each new version of a dataset), checks the constraints set by the producers of the dataset, and publishes the sets to consumers on success. In the event of an error, publication of the dataset can be stopped and producers and/or data stewards are notified to act. Data quality issues do not propagate to consumer data pipelines , reducing their blast radius.

Deequ is also used in Amazon SageMaker Model Monitor. Now, with the availability of PyDeequ , you can use it in a wider set of environments — Amazon SageMaker notebooks, AWS Glue, Amazon EMR, and more.

Let’s look at the main components of PyDeequ and how they relate to Deequ :

– Metric calculation — Deequ calculates data quality metrics, that is, statistics such as integrity, maximum, or correlation. Deequ uses Spark to read sources like Amazon Simple Storage Service (Amazon S3) and calculate metrics through an optimized set of aggregation queries. You have direct access to the raw metrics calculated on the data.

– Constraint checking — As a user, you focus on defining a set of data quality constraints to check. Deequ takes care of deriving the necessary set of metrics to be calculated on the data. Deequ generates a data quality report that contains the constraint check result.

– Constraint suggestion — You can choose to define your own custom data quality constraints or use automated constraint suggestion methods that profile the data to infer useful constraints.

– Python wrappers — you can call each Deequ function using Python syntax. The wrappers translate the commands into the underlying Deequ calls and return their response.

Example following the Data Quality Lifecycle

Let’s think about a multinational company that offers loans, and this company is in the process of implementing a Lakehouse whose main objectives are the centralization, quality and analysis of data. We will focus on data quality, using the Data Qality Lifecycle .

Data Discovery — We’ll start with the data discovery phase.

  • One of the most important data sources and which will be the first to be ingested by the lakehouse is the entity: global customer loan from the company’s ERP database, which contains:
  • Customer’s full name;
    Customer type with 2 possible values: individual or commercial;
  • Last four digits of the customer’s social security number;
  • Customer’s outstanding loan balance;
  • Customer loan interest income;
  • Income from customer loan fees;
  • Customer loan guaranteed by type of property;
  • Customer’s country of residence.

Data Profiling — Once the data sources, etc. have been identified, the data steward will conduct data profiling, which includes an initial examination of the data, a sample data quality check, rule suggestion, and approval of final data profiling rules. data quality. See the figure. Example:

  • The data steward will select an initial set of data quality metrics to run on all new input files. In this example, we’ll take a subset of the entity, Global Customer Loan, and profile the data using the following set of metrics:
  • Completeness of data (ie there are fields with missing data);
  • Different count in the type of customer;
  • Different count in the country of residence;
  • Different guaranteed loan count by property type;
  • Data type in the last four digits of the SSN;
  • Outstanding balance, interest income, and fee income data type;
  • Tools like PyDeequ have modules for suggesting data quality validations, such as checking data type and acceptable values.

Data Rules — Once the data quality profiling rules are finalized (previous phase), they will be fed into the verification phase, where the actual data quality verification is performed on the input files to examine the accuracy of the data and whether the data is fit for purpose. Data quality exception reports must be generated.

Data Distribution & Remediation — Based on the reports generated in the previous phase, the data steward distributes the report to the responsible people or sectors, and they will provide the adjustments (remediation), this can be, for example, creating a rule in the ERP system in the entity: global customer loan to set Customer Types, i.e. do not allow users to manually enter this information. Another example would be creating masks for the social security number.

Data Monitoring — This process takes care of the continuous monitoring of the remediation process, for example, monitoring which adjustments were requested from the ERP supplier, in addition to creating data quality panels for the most diverse needs. Here you can use data visualization tools such as AWS Quicksight or Microsoft Power BI for example.

I hope this article has given you an overview of data quality and what a data quality process would look like. Reinforcing the idea that quality data generates quality analysis and insights, which really add value to the organization’s business.

References

Version in Portuguese (Compass.uol Blog)

How to Architect Data Quality on the AWS Cloud

Testing data quality at scale with PyDeequ Spark for Glue development

The importance of Data Quality in Companies

Great Expectations Pandas Profiling

--

--