5 Principles You Need to Know Before Using Google Cloud Dataprep for Data Preparation
Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for exploring, cleaning, and preparing structured and unstructured data.
There are 5 principles important to know before your data preparation with Dataprep.
1. Create baseline dataset before profiling source data
Before you get started cleaning your dataset, it is helpful to create a virtual profile of the source data. First, create a minimal recipe on a dataset after you have ingested into the Transformer page. Then, click Run Job to generate a profile of the data, which can be used as a baseline dataset for validating and debugging the origin of data problems you discover.
2. Normalize data before applying Deduplicate Transform
Remove identical rows from your dataset after a uniqueness check is a common step in data preparation. Google Cloud Dataprep provides a single transform deduplicate, which can remove identical rows from your dataset.
There are 2 limitations:
- This transform is case-sensitive. So, if a column has values
Darren
andDARREN
, the rows containing those values are not considered duplicates and cannot be removed with this transform. - Whitespace and the beginning and end of values are not ignored.