Data Cleansing in the Age of Big Data

Guy Fighel
SignifAI
Published in
6 min readMay 1, 2017

The quantity and the reliance on data by enterprises is spreading like wildfire. In EMC’s The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, analysts estimated that the digital universe is doubling in size every two years and predicted it will multiply 10-fold between 2013 and 2020 — from 4.4 trillion gigabytes to 44 trillion gigabytes.

Unfortunately, deriving meaningful insights from all this data and converting it into action is easier said than done. The tremendous amount of data is one thing — there’s a reason it’s called big data. The bigger issue is bad data quality.

The fact is data isn’t always usable as is, and preparing it so it can be used and data cleaning or data cleansing is typically slow, difficult, and tedious. With more companies applying DevOps principles to big data projects, the delays inherent in the data cleaning process can have serious ramifications and negate the benefits that DevOps is supposed to provide.

That’s because DevOps relies on rapid deployment frequency. That can’t happen when the data science teams and developers who rely on quick access to usable data must spend more time making “bad” data usable than actually using it.

The Need for Clean Data

By some estimates, so-called “dirty” data costs the US economy up to $3.1 trillion a year. That’s not surprising given that poor data quality can lead to inaccurate data analytics results and drive misguided decision making — both of which are detrimental to data scientists and developers alike. It can also expose companies to compliance issues since many are subject to requirements to ensure that their data is as accurate and current as possible.

Process architecture and process management can help reduce the potential for bad data quality at the front end, but can’t eliminate it. The solution, then, lies in making bad data usable by detecting and removing or correcting errors and inconsistencies in a data set or database — data cleansing.

The Challenge of Data Cleansing

Unfortunately, data cleansing is a time-consuming endeavor. A survey conducted by CrowdFlower, a provider of a data enrichment platform for data scientists, reported that data scientists spent 60% of their time on cleaning and organizing data and 19% of their time collecting data sets. That adds up to almost 80% of their time devoted to preparing and managing data for analysis.

Part of the problem is that data cleansing is a complex, multi-stage process. Best practices entail employing a detailed data analysis as a first step in detecting which kinds of errors and inconsistencies must be removed. In addition to a manual inspection of the data or data samples, analysis programs are often needed to gain metadata about the data properties and detect data quality problems.

Software that employs machine learning helps, but because data can come from any number of disparate sources, the data cleansing process also requires getting data into a consistent format for easier usability and to ensure it all has the same shape and schema. Depending on the number of data sources, their degree of heterogeneity, and how bad the quality of the data is, data transformation steps may be required as well. Then, the effectiveness of a transformation workflow and the transformation definitions must be tested and evaluated. Multiple iterations of the analysis, design, and verification steps may also be needed.

After errors are removed, the cleansed data must replace the bad data in the original sources. This ensures that legacy applications have the updated data as well, minimizing potential rework for future data extractions.

There are other time-consuming challenges as well. For example, the available information on the anomalies is often insufficient to determine how to correct them, leaving data deletion as the only option. However, deleting the data means losing information. Then there’s the fact that data cleansing is not a one-time thing. The process must be repeated every time data is accessed or values change.

Data Cleansing for DevOps and Big Data

Data cleansing has important implications for DevOps teams as they take on big data projects. DevOps brings down silos, and promotes collaboration and communication between data scientists, developers and others tasked with analyzing big data and using the insights to drive smart business decision making. Speed is what drives the demand for DevOps, but dealing with the volume, velocity, and variety of big data can’t help but slow down processes whether they are specific to software development or a data science project.

Big data is complex by nature with its ever-increasing accumulation of unstructured and semi-structured data from a myriad of sources, including sensors, mobile devices, network traffic, Web servers, custom applications, application servers, GPS systems, stock market feeds, social media, as well as data from structured databases, logs, config files, messages, alerts, scripts, development feedback loops, and metrics. … the list goes on. Disparate sources of data translate into equally disparate formats. Data science teams can’t make sense of the data until it’s transformed into a unified form. That creates a significant bottleneck that even DevOps, on its own, can’t overcome. For example, online web application might be sending data in a SOAP/XML format over http, feed might be coming in a CSV file format and devices might be communicating over MQTT protocol.

Consider the challenge facing DevOps teams using metrics, presented as numbers, and logs, presented as text, to understand and improve the performance of code, services, and infrastructure environment in production. Effectively integrating log data and performance metrics within a network can greatly reduce time in resolving critical issues, and simplify the distribution of data to customers and developers.

The problem is that logs and metrics come in varied forms, making correlation and analysis for each difficult and between the two next to impossible. Metric data format is short. It describes measurements beyond the measured value, including type, location, the time of measurement, and grouping. Generated by infrastructure or applications, logs are used to provide operational teams with as much specific detail as possible to help them analyze a specific operational or security event. As a result, they tend to be longer than metrics and can come in a variety of shapes and form. While some logs are standardized, their formats are often defined by the developers.

Data cleansing not only removes errors from both data types. It also transforms log data and metrics to a common format, providing teams with shared views and insights across the entire application environment. That helps speed up issue remediation with code and the frequency of production code updates, as well as helps teams understand the impact of their code is making at any production stage and scale.

Automated Data Cleansing to the Rescue

Fortunately, as both big data and DevOps become “business as usual,” the need for overcoming the data cleansing hurdle is driving development of technologies that can automate data cleansing processes to help accelerate business analytics. IDC predicts that through 2020, spending on data preparation tools will grow two and half times faster than traditional IT-controlled tools for similar functionality.

Less time spent on data cleansing will result in closer to real-time analysis of incoming data — regardless of its original format, which in turn will produce faster, actionable information. Ensuring data is clean and presented in a common format will also help eliminate rework, enabling DevOps teams to perform quick root cause analyses on issues so problems can be addressed quickly. That’s essential because quick resolution of problems assures that data pipelines can keep flowing continuously.

Best

Guy

Originally published at blog.signifai.io on May 1, 2017.

--

--

Guy Fighel
SignifAI

Co-founder and CTO @SignifAI. Thinking machine intelligence, software engineering, scaling infrastructure, systems automation. Previously @TangoMe, @Vonage.