5 Principles You Need to Know Before Using Google Cloud Dataprep for Data Preparation

Published in

Google Cloud - Community

3 min readAug 1, 2018

Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for exploring, cleaning, and preparing structured and unstructured data.

There are 5 principles important to know before your data preparation with Dataprep.

1. Create baseline dataset before profiling source data

Before you get started cleaning your dataset, it is helpful to create a virtual profile of the source data. First, create a minimal recipe on a dataset after you have ingested into the Transformer page. Then, click Run Job to generate a profile of the data, which can be used as a baseline dataset for validating and debugging the origin of data problems you discover.

2. Normalize data before applying Deduplicate Transform

Remove identical rows from your dataset after a uniqueness check is a common step in data preparation. Google Cloud Dataprep provides a single transform deduplicate, which can remove identical rows from your dataset.

There are 2 limitations:

This transform is case-sensitive. So, if a column has values Darren and DARREN, the rows containing those values are not considered duplicates and cannot be removed with this transform.
Whitespace and the beginning and end of values are not ignored.

5 Principles You Need to Know Before Using Google Cloud Dataprep for Data Preparation

1. Create baseline dataset before profiling source data

2. Normalize data before applying Deduplicate Transform

Written by Alina Zhang