5 Principles You Need to Know Before Using Google Cloud Dataprep for Data Preparation

Alina Zhang
Google Cloud - Community
3 min readAug 1, 2018

--

Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for exploring, cleaning, and preparing structured and unstructured data.

There are 5 principles important to know before your data preparation with Dataprep.

1. Create baseline dataset before profiling source data

Before you get started cleaning your dataset, it is helpful to create a virtual profile of the source data. First, create a minimal recipe on a dataset after you have ingested into the Transformer page. Then, click Run Job to generate a profile of the data, which can be used as a baseline dataset for validating and debugging the origin of data problems you discover.

2. Normalize data before applying Deduplicate Transform

Remove identical rows from your dataset after a uniqueness check is a common step in data preparation. Google Cloud Dataprep provides a single transform deduplicate, which can remove identical rows from your dataset.

There are 2 limitations:

  • This transform is case-sensitive. So, if a column has values Darren and DARREN, the rows containing those values are not considered duplicates and cannot be removed with this transform.
  • Whitespace and the beginning and end of values are not ignored.

--

--