Bad Data — The Virus Lurking in your Business

Published in

Inside Machine learning

8 min readJun 11, 2018

There’s a virus in your business. It’s not the kind you would normally associate with suspect code from the internet that inadvertently crept through your firewalls. No. It’s a lot simpler than that but potentially far more devastating. It’s poor quality data.

You don’t know it’s bad quality until you find it’s inaccurate — maybe a customer tells you it’s wrong or false negatives and false positives lead you to that conclusion. But by that time, it’s most likely too late. It’s been replicated, shared across multiple systems. Other data has been extrapolated from it. You may have shared it with other departments, even third-parties such as business / trading partners or regulatory organizations. Hundreds, thousands possibly millions of transactions later you discover this. Decisions and investments have been made based on that data. Worst of all — there is no way back from undoing everything that’s been done.

Prevention is better than the cure — but just like humans trying to stay as healthy as possible, organizations can only do their best to minimize the risks of bad data entering or spreading like a contagious disease across their organization and the systems that consume it.

Technology alone cannot prevent bad data but it can augment users’ processes of collecting and managing that data over its lifecycle. Sometimes organizations may set targets on staff for the quantity of data being processed instead of the quality of that data.

If these factors don’t convince you to focus on data quality then the General Data Protection Regulation (GDPR) might. Data subjects have the right to know what data is stored about them and an organization needs to satisfy that request within a set period of time. It’s more than just an embarrassment if the data an organization holds is incorrect. Organizations may have to explain how that data came into their possession and if inaccurate why — particularly when dealing with personally identifiable information (PII).

Why Data Governance is Crucial for Data Quality

Organizations need to know that their data is correct and available to users that have a right to view and process it. Data is used to help deliver business efficiency and to drive business transformation and innovation. Organizations have a corporate responsibility to manage and protect that data to meet corporate, industry and government regulations. Data Quality is a key element of Data Governance. There is a clear need to make good quality, well understood, governed data available to authorized users. In short there are 3 key considerations that organizations need to consider:

Know your data. This could mean building a 360-degree view of a particular focus area — for example, a 360-degree view of the customer. Organizations may need to gather internal data — and external data from social media, click stream, census or other relevant sources. Data must also be accessible by all users and/or applications that need it — on-premise or across hybrid cloud. This could mean making data globally accessible to many applications regardless of computing platform. A common access layer, ontologies, and business glossary to help understand data elements are all key elements of what an information and governance catalog should provide.
Trust your data. Well-governed data provides confidence in not just the data itself, but in the outcomes from analytics, reports and other tasks based on that data. There are two key points to data governance: First, organizations must have the ability to ensure the data is secure and adheres to compliance regulations. And second, they must have the ability to govern data so users can find and access information themselves, at the exact time they need it.
Data as a source for insight and intelligence. This means having the right skills and tools in place to surface insights, as well as the right technology to learn from the data and improve accuracy each time that data is analyzed.

More than just a data quality platform

IBM InfoSphere Information Server (IIS) can help organizations integrate and transform data and content to deliver accurate, consistent, timely and complete information on a single platform unified by a common metadata layer. It provides common connectivity, shared metadata, a common execution engine to help facilitate flexible deployment across on-premise, grid, cluster, cloud or natively on Hadoop environments. It can help accelerate and automate data quality and governance initiatives by:

Automatically discovering data and data sources
Automate rules triggering custom DQ actions based on business events
Utilizing Machine Learning for an accelerated Metadata Classification Process / Auto Tagging (discussed in more detailed later)
Automatically classifying data — including understanding PII risk

The Benefits of Industry Models

Getting started with data quality can be a daunting task. To help ease the initial burden and arrive at a standardized data model, industry models are available that provide pre-built content.

200+ out-of-the-box Data Classes (clients can expand)
200+ out-of-the-box Data Rule Definitions (clients can expand)
QualityStage Address Verification Interface (AVI) with 248+ country coverage
Stewardship Center and Business Process Management help enable customized data quality exception records to be routed for notification and/or remediation

Data Profiling and Quality — Core Capabilities

A key capability of data quality is deep data profiling and analysis to understand the content, quality and structure of tables and files. This includes

Column Analysis — min/max, frequency distributions, formats, data types, data classes,
Data Classification — measuring all columns against a set of pre-defined (200+) data classes and clients can expand these
Data Quality Scores — all data values in all columns are measured against 10 data quality dimensions (configurable and expandable)
Primary Key / Multi-Column Primary Key analysis
Relationship Analysis — discover and validate PK->FK relationships
Overlap Analysis — measure % of duplicate/same values across columns, across 1 or more data sets

This is like pulling ‘Double Duty’ because it uses the built in Data Rules, combined with the organization’s business logic to identify exception records for statistics and/or remediation.

In addition, users can specify consistent & re-usable data rules, driven by the business. For example, rules can be written in a language that is less technical than SQL and be written once and applied in multiple places. For example, all SSN number columns should comply to the same set of rules. Customers are able to run 100’s or 1,000’s of data rules on a daily, weekly, or monthly basis.

The Role of Machine Learning and A.I. on Data Quality

One of the tasks a data scientist regularly faces is ensuring their test data is of suitable quality. I don’t need to explain the impact of developing a model that was trained on poor quality data. Machine learning can help find similarity of data across different silos and unify a variety of data. When the model is deployed algorithms can be used to learn about the data and help improve the quality of new data as models encounter it.

Machine learning and neural networks are used in the IBM Unified Governance & Integration platform to identify probabilistic matches of multiple data records that are likely to be the same entity, even if they look different. This makes analyzing of master data for quality as well as business term relationships possible and faster, which has been a major pain point for many clients.

Feedback learning is applied so if the confidence score of a match is below a certain threshold level, the system can refer the candidate data records to a human expert using the workflow. It is far more productive for those experts to deal with a small subset of weak matches than an entire dataset.

Consider a new data scientist being given a task to develop a machine learning model to detect customer churn for a specific product or service. While they have an idea of what needs to be accomplished, they have no idea what data sets should be used to start with. Within IBM data governance technologies, machine learning can help the data scientist search for “Customer retention” and get a graph view of all connected entities including associated privacy information, where drill-down is available and find more information about quality and authenticity of data.

A classification or taxonomy is a way of understanding the world by grouping and categorizing. Many organizations use the social security number (SSN) to track a customer across various investment products for example, but they may appear in various forms such as Tax Identification Number or Employee Identification Number. Using traditional rule based engines, it’s difficult to figure out that these three terminologies essentially refer to the same entity. In contrast, one term may also have different meanings in the same organization. Machine learning models offer a new way to train the system to describe “domains” from the data that helps find these relationships.

Traditional techniques of metadata matching and assignments are rule based. It is important to understand that machine learning models while being able to do a better job with ambiguous data sets, are not a replacement for all existing application rules. ML does not try to replace the existing application rules and regular expressions where they are proven to work well — rather it just augments them. This combined approach empowers users by automatically assigning terms with higher confidence yet not ignoring domain expertise when it is required.

If you take GDPR (Global Data Protection Regulation) for example, there are 4 steps to process before GDPR terminologies can relate to your business terminologies and be leveraged for privacy regulations:

Supportive content terms must be manually extracted from GDPR documents (from various articles and sections)
Hierarchies must be created for the key categories
Supportive content terms must be matched manually with the business terms by domain experts
And finally, these supportive content terms must be mapped to the business data model

Machine learning is used to create a neural network model that interprets certain regulations based on other similar regulations. This not only extracts the supportive content terms from a raw document but creates a well- formed taxonomy that can be more easily ingested into the governance catalog.

Looking at the Bigger Picture

IIS while providing very detailed analysis and control of many aspects of the data quality and the broader unified governance and integration platform also provides end-to-end views of relationships, dependencies and lineage across multiple sources of data. Users can see high level policies down to the granular policies, the relationship to individual process rules, as well as rules that apply to data and how and where they are applied to data and meta data. Put another way, users can see which rules and how those rules operate on the data values contained in a given column, the rules governed by (described by) a governance rule and how governance rules are driven by governance policies and sub-policies. By applying machine learning capabilities across data quality processes and the bigger information governance solutions as outlined above organizations can help significantly enhance their data quality initiatives by having systems that learn and become progressively smarter about data quality across the enterprise. The take away is that it is better to try to detect and prevent poor quality data from ever being used by your applications than trying to clear it up later.

For more information on data quality visit ibm.com/analytics/data-quality.

To try machine learning for free visit datascience.ibm.com.

Mark Simmonds

Program Director, IBM Analytics Development.