The Data Analytics Platform’s Data Quality Framework

How do we approach data quality?

Aorszulska
inganalytics.com/inganalytics

--

ING WBAA’s Data Analytics Platform (DAP) has been designed as a self-service platform for anyone who wants to undertake analytics initiatives. DAP facilitates multiple use-cases: Analytics and Business Intelligence, Experimentation, Data Science Models Training and Deployment. The platform is fed with real production data only, meaning data that is produced by ING systems.

In DAP we hold a large amount of data from a variety of sources, mainly it’s raw data that is being consumed by analytics projects.

The main reason and benefit of providing pure, unmodified, raw data is that data users prefer doing the cleansing, normalization, enrichment, and overall data preparation work on their own. The scope and approach for these operations vary between different projects. It is particularly important when developing machine learning (ML) models, especially if they use unsupervised learning algorithms like cluster analysis.

An additional benefit of supplying the complete dataset without any transformations is that it reduces the amount of time the users need to access the data they need to perform their work.

Another option available in ING is to capture data from high-quality repositories which contain cleansed and unified data. However, these repositories usually contain only subsets of the original datasets as they are used primarily for reporting purposes. DAP provides this option also to our users, selected only if such datasets can be leveraged in their analytics and business objectives.

Raw data might mean low-quality, polluted information. As data quality might mean different things for different users, we do not fix quality issues when loading data on DAP. It is provided to users as-is. The way we support our users in their projects is by raising awareness about the data we have. We achieve this by providing data quality reports which can be analyzed before using the data. Thanks to these reports, users are able to learn about the condition and quality of the data in DAP.

The process of generating these reports will be extended in the nearest future with an automatic mechanism that will detect when quality drops below a specified threshold and will notify the Data Stewards to take remediation actions on defective datasets provided to DAP. Once a data quality issue is corrected at the source side, it will be prevented from occurring again on DAP as well as in other places relying on it. It’s very important to have data with exactly the same characteristics in environments where analytics models are trained and executed. Otherwise, the differences affect the outcome of deployed ML models since their behavior will be different than expected.

Sometimes data quality issues can’t be discovered at the source side and you need to run cross-check tests against several data sources at once to spot an issue. The more data we have on DAP, the more tests of this nature we can run for our users.

Ideally, any issues with the data would mean it is immediately rejected, and from there corrected data is re-delivered by the data provider. However, we are realistic and understand this is not always possible: some applications are inaccessible for updates (especially legacy ones) or the waiting time is excessive. Sometimes we just need to accept a lower degree of accuracy in our data

In DAP, the processes of checking the quality and calculating the profiling statistics of datasets are automated and based on the frequency of new data arrivals — In most cases, this is done on a daily basis.

After checking the reports, the users can assess the level of data quality that is acceptable to them, and if so — proceed with further analysis using the selected algorithm. The data discovery tool available on DAP provides information about the data owners and the frequent users of data — this way when in doubt, the user can reach out to them to get further information about the data at hand.

The following reports are prepared on a regular basis:

Data Profiling

Data Profiling is the examination and statistical analysis of the ingested datasets providing a clearer picture of the contents of the data. These types of statistics are well-defined, industry-standard metrics and can be deployed automatically without involving users. The default set of checks run on every dataset: column statistics, value distribution, and pattern distribution.

After the data is characterized, the acquired information and associated inferences have a direct bearing on the way the analytical model is prepared.

For each column in the dataset the following statistics — if relevant for the column type — are presented in an HTML report:

  • Type inference: detect the types of columns in a table
  • Essentials: unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Distribution (histogram)
  • Correlations highlighting of highly correlated variables, Spearman, Pearson, and Kendall matrices
  • Missing values matrix, count, heatmap, and dendrogram of missing values
  • Text analysis learns about categories (Uppercase, Space), scripts (Latin, Cyrillic), and blocks (ASCII) of text data.

The pandas profiling library is developed by ING WBAA’s data scientist Simon Brugman and is utilized to help us characterize our data.

Data Quality

ING specified nine functional quality metrics which should be validated: Timeliness, Completeness, Validity, Accuracy, Uniqueness, Consistency, Traceability, Adaptability, Availability. Most of those metrics are required by ING’s regulators for reporting purposes. Uniqueness and Completeness are checked at the column level for all ingested datasets during the Data Profiling process.

Consumers’ viewpoint on the data quality of the same datasets varies depending on the use case. Therefore, in DAP the users are allowed to decide what validity checks are important to them and their projects. Defining the data quality rules requires business domain expertise — that’s why the users’ involvement is necessary.

There are several frameworks that you can use to validate data quality — the Great Expectations (Introduction — great_expectations documentation) has been chosen for DAP. Great Expectations lets you define both: simple and complex quality tests. It provides the ability to specify Suites of Expectations on data — the way we expect the data should appear. A list of checks that can be executed on data is really long, they are related to e.g. Table shape, Missing values, Unique values, Sets of ranges, String matching, Aggregate functions, Multi-column. Moreover, the tool allows the creation of customized expectations — meaning specific to the user’s data domain.

The suite can include one or more rules, depending on the context and criticality of a certain attribute. Those rules can also be applied to multiple attributes if there are cross-column dependencies. It’s possible to run many different suites against the same dataset.

Some examples of expectations are:

Data Stability

When ML models are being developed, certain data patterns are used for the models’ training. After deploying such a model in a production environment the stability of the new incoming data should be constantly monitored against the appearance of anomalies in the data fed to the model. The data drift usually leads to model performance degradation, thus the accuracy of a model is reduced. After detecting the data drift the remediation actions are necessary, e.g. model tuning or retraining.

DAP is used for building and training models and only real production data is ingested for projects, therefore a report reflecting the stability of the incoming data is much desired.

For this reason, the Population Shift Monitoring (popmon) tool, developed by our ING WBAA’s Chapter Lead Data Scientist Max Baak and the data scientist Tomas Sostak is used. (GitHub — ing-bank/popmon: Monitor the stability of pandas or spark data frame ⚙︎)

popmon checks the stability of a dataset over time. It does so by taking as input a DataFrame — either Pandas or Spark — where one of the columns should represent the date, and will then produce a nice-looking report that indicates how stable all columns are over time.
For each column, the stability is determined by taking a reference (for example the data on which you have trained your classifier) and contrasting each time slot to this reference.

More info about Popmon can be found here: Popmon Open Source Package — Population Shift Monitoring Made Easy | wbaa (medium.com)

Usually, data quality checking or data profiling are steps included in the data ingestion pipeline, occurring during or just after loading the data. In DAP’s environment, data quality processes are decoupled from data ingestion and scheduled as separate executions. This way we achieve better utilization of computing and memory resources. Preparing the reports for very large datasets can consume quite a lot of resources. By splitting the data quality processes we avoid interfering with our users' tasks running during business hours.

An overview of DAP’s Data Quality Framework:

DAP’s Data Assets squad

DAP’s Data Assets squad is the force behind the automatic framework described above. Always listening to our users’ needs, we know that by being able to have access to data quality metrics while learning about their data’s characteristics and identifying data drift, they can perform their analytics projects in a timely and effective manner.

WBAA’s DAP is ING’s go-to platform combining the latest open source tooling, significant computational power, a highly secure and compliant environment, and all disciplines of analytics in one place. Petabytes of ING data are centralized in DAP, providing a workbench to discover, model, visualize and analyze big data.

If you wish to learn more about ING WBAA’s Data Analytics Platform, have a look at the blogs below written by my colleagues:

--

--