Disrupting Data Preparation & Integration

Published in

Rohit's Perspectives

5 min readAug 7, 2018

What is Data Preparation?

At a high level, data preparation involves combining different types of data residing in disparate data sources and unifying them into a single view. Data preparation is “the most time-consuming task in analytics and BI” and the market was estimated to be $780 million in 2016 and will grow to $1.50 billion by 2021 at a 18.5% CAGR according to Gartner. Some factors driving adoption of data preparation tools include automation and the operationalization of the integration process. Ultimately, large enterprises are realizing that they need to innovate their data supply chains to make digital transformation initiatives successful.

Despite this awareness, companies are struggling to identify actionable insights because traditional rules-based methods cannot keep up with the growing volume and velocity of data. Further, data is often in different formats and effectively isolated across departments and regions. Organizations face so much difficulty that they “spend more than 60 percent of their time in data preparation” according to the same report. Before detailing modern solutions to these problems, it would behoove us to first take a look at the history of data preparation.

Brief History

The first generation of data preparation and integration systems were termed ETL (Extract, Transform, and Load) products and would combine data from a limited number of sources into a data warehouse. The data warehouse first arose in the 1990s in response to retailers’ need to analyze customer data (e.g. product sales) to make better buying decisions — only order the items that have strong sales and solid margins. However, most data warehouse projects at the time were unsuccessful because data integration challenges caused significant delays and budget issues.

These difficulties led to a second generation of ETL systems, which included data cleaning modules that allowed the system to ingest other types of data. These second-generation systems generally followed the same architecture of their predecessors, and were subsequently still oriented as programmer-enhancing tools. The second-generation tools have two main weaknesses: scalability, and lack of expert input. Enterprises at the time had hundreds of data sources, ranging from public data on the web such as real estate transactions to company financials, and wanted to extract actionable insights from them. To accomplish this task, tools would need to extend their capabilities to deal with hundreds to thousands of data sources. Furthermore, these tools did not have visibility into insights that experts inside an organization had as they were targeted at programmers, who could not answer basic data curation questions.

The third generation of these products are designed to scale to thousands of data sources and utilize business experts to resolve curation questions that arise. To do this, systems are incorporating statistical methods and machine learning to automate a lot of the process and reach out to experts when needed. These companies have two main form factors: horizontal, products that are designed to solve a variety of issues spanning multiple industries, and vertical, products designed to address issues specific to an industry.

Horizontal Solutions

An example of this kind of third-generation product is Tamr, which was founded in 2013 by database management industry veterans Andy Palmer (CEO) and Mike Stonebraker (CTO). The two previously co-founded Vertica Systems, a database management company, and sold it to HP for $350 million. Tamr’s product capabilities include connecting data sources, cleaning the unified dataset, and classifying records using deduplication techniques and statistical methods. Tamr has developed a horizontal data preparation and integration application that boasts many use cases, including clinical data conversion, wherein Tamr is used to convert clinical trial dataset to a specific standard, which is often time consuming and expensive. Tamr is also used to help scale customer journey analytics, optimize procurement processes, and improve the accuracy of demand forecasting. New generation data preparation and integration products possess compelling business model advantages — instead of charging customers for the volume of data prepared and integrated, they charge for the number of different data sources that are concurrently integrated, which does not punish customers for scale.

Similarly, Unifi Software offers a self-service data preparation and discovery platform with solutions spanning cloud migration, e-commerce, and audience analytics.

Vertical Solutions

In addition to the rise of horizontal applications, a few startups have developed vertical-focused data preparation and integration solutions. Industries which are uniquely poised for disruption by such tools include healthcare and real-estate, which both involve large volumes of data residing in disparate sources and typically in different formats. Nunetz, a recent Alchemist Accelerator grad, is addressing this issue in ICUs and ultimately the broader clinical health environment — vital patient information is spread across physical and electronic medical records and different medical devices — using machine learning techniques to clean, normalize and analyze data.

Skyline AI is tackling the same problem in commercial real estate — investors typically depend on Excel spreadsheets with limited data, outdated market data, and gut feelings to make investment decisions. The company’s artificial intelligence powered due diligence platform is built to leverage both static and time series data from private and public data sources to predict the risk, yield and overall profitability of a real estate investment opportunity, provide customers with access to institutional-grade commercial real estate opportunities vetted by advanced technology and streamline & digitize the real estate investment process for investors.

Thoughts

I believe that horizontal and vertical-focused data preparation and integration tools can coexist. Vertical-focused companies may scale more efficiently than their horizontal-focused counterparts and possess moats in the form of proprietary integrations developed to extract data from specific sources. On the other hand, horizontal-focused companies can diversify their revenue and chase after larger market opportunities. Time will reveal which type of data preparation and integration solution ultimately wins out.