Data preparation — is there a process to follow?

Norman Paton
The Data Value Factory
3 min readSep 10, 2019

Data preparation preprocesses data for analysis. How should a data scientist or data engineer approach data preparation?

Image by ar130405 from Pixabay

There is no widely accepted data preparation process. Furthermore, where processes are discussed, this is often in rather an abstract way (as here). The documentation for tools sometimes suggests a process, but tools tend not to impose a specific way of working. Indeed, the process presented here could be applied using a variety of data preparation platforms. Here we present an outline of an iterative process that could be followed during data preparation:

The steps in the process

This process involves the following steps:

  • Design/Refine Target: It is necessary to have some outcome in mind, even if this evolves over time. A target table definition pins down the intended result of the data preparation process, and informs the subsequent steps as they seek to populate this target. Many data preparation tools are bottom-up (i.e. they work forward from the sources) and do not require an explicit target. Nevertheless, the data scientist/engineer must at least have some idea of the intended outcome.
  • Discover Sources: Data sources need to be identified that can be used to populate the target. Key sources may already be known and available, but different sources will have different roles. For example, some sources may provide the type of data to be analyzed (e.g., properties, companies, suppliers), whereas other sources may be able to augment or validate such data (e.g., address lists, company registers, product catalogs).
  • Select Sources: In the proposed methodology, a subset of the potential sources should be selected for further investigation, for example on the basis of profiling results. Experience with these sources may inform subsequent iterations.
  • Repair Sources: The selected sources may have quality problems that are best dealt with at the source level. Deferring data cleaning to the populated target may lead to a single challenging step as a replacement for several simpler steps. For example, format transformation is easier when there are fewer different formats present at the same time. In addition, combining the sources may be easier where they have been cleaned, for example to increase the consistency of join columns.
  • Integrate Sources: Several sources may need to be combined to populate the target. This step may identify several different ways of populating the target. These can be selected between on the basis of their quality or relevance at later stages.
  • Repair Result: Given the identified ways of populating the target, these can be reviewed on the basis of their quality, to identify the most promising current results. This may be passed on for downstream analysis, or issues with this result may inform subsequent iterations.

Summary

Although this methodology is not detailed, it manifests features that will often be important in practice. It seems that iteration will be necessary; the best selection of sources likely depends on quality details that will only become apparent later. Furthermore, combining data may reveal features that were less than obvious before. For example, the coverage or consistency of different data sets may not be immediately obvious in isolation. Increased understanding of different sources may lead previously missed or passed over sources being revisited. Furthermore, there may be limits on the time available for data preparation, and incrementally improving results reduces the risk of missed deadlines and unmanaged expectations.

Norman Paton is a Founder at The Data Value Factory, and a Professor of Computer Science at The University of Manchester. He works mostly on techniques for automating hitherto manual aspects of data preparation. Connect with him on LinkedIn, mentioning this story.

Originally published at https://thedatavaluefactory.com on September 10, 2019.

--

--

Norman Paton
The Data Value Factory

Norman is Professor of Computer Science at The University of Manchester, and is a Director at The Data Value Factory.