Data preparation — why is it a big deal?

Norman Paton
The Data Value Factory
3 min readJul 8, 2019

Data preparation is important because obtaining insights from data is important. Such insights can often only be obtained after some data preparation. How much is typically required? The scale of the task depends on the application, but evidence suggests that significant effort is consumed in data preparation.

Image by StartupStockPhotos from Pixabay

The cost of data preparation

Reports in the press indicate that data scientists spend 50% to 80% of their time “collecting and preparing unruly digital data”. In addition, a recent survey reported that data scientists are averaging 79% of their time on data preparation activities, including collecting data sets (19%), and cleaning and organising data (60%).

Furthermore, work on data collection, integration and organisation may be considered to be the specialism of data engineers, so people with different job roles may be focusing significant effort on data integration and data quality tasks. Furthermore, Glassdoor reports that the average base data scientist salary in the US is $115K.

Data preparation tools

The need to prepare data for analysis nothing new; enterprise data warehouses have been central to reporting and decision support in large corporations for decades. Data warehousing has given rise to technologies to support the Extract, Transform and Load (ETL) process, whereby warehouses are populated from local transactional systems. Techniques developed for data warehouses are potentially relevant to other analysis settings. However, the trend towards data driven enterprises is accompanied by an increasing profile for self-service data preparation platforms. Such platforms, such as our DataPreparer, aim to have a less steep learning curve than traditional ETL systems. These developments are taking place in a setting where enterprises have diverse data management practices. Such internal diversity means that even the internal information of relevance to an analysis may be widely distributed.

Making the most of the available data

The potential for combining external and internal data to obtain new insights means that more rapid data preparation is likely to be important. This is especially the case in highly dynamic settings with rapidly changing requirements. For example, an e-Commerce site may want to compare prices or offered products with numerous online competitors.

Such outside insight examples potentially require access to numerous independently produced data sets. Complex organizations, for example resulting from mergers or with divisions with considerable autonomy, also provide highly heterogeneous data environments. For example, obtaining a clear position on customer contacts, suppliers or skills may involve integrating data from numerous resources.

As a result, data preparation is a necessary activity for many organizations. Often viewed as a necessary evil, as data quality and integration problems must be overcome before insights can be obtained, it can also be seen as an enabler. Many organizations report that they are not making the most of the available data. The scale and significance of data preparation is also reflected in the growing market for related products; for example, a 2017 survey by Grand View Research predicts that the data preparation tools market had a value of $1.1 billion in 2017, and will have a compound annual growth rate of 25.1% from 2017 to 2025.

Norman Paton is a Founder at The Data Value Factory, and a Professor of Computer Science at The University of Manchester. He works mostly on techniques for automating hitherto manual aspects of data preparation. Connect with him on LinkedIn, mentioning this story.

Originally published at https://thedatavaluefactory.com on July 8, 2019.

--

--

Norman Paton
The Data Value Factory

Norman is Professor of Computer Science at The University of Manchester, and is a Director at The Data Value Factory.