Data Pre-Processing with Google Cloud Dataprep

Praveen kumar Arja
Jun 14, 2018 · 5 min read

One of the most important things while building an Machine Learning based model is to prepare/cleanse the raw data. Working on one such project gave me a chance to explore one of Google’s product — Cloud Dataprep. After a preliminary analysis of this dataset, my senior quickly realized that the dataset we are dealing with can leverage full capabilities of Google Cloud Dataprep for quicker data cleansing. I had just tested Cloud Dataprep at a high level before, and this was the opportunity to learn as much as I could about the application of Cloud Dataprep.

One of the main reasons to choose Dataprep is to reduce efforts of cleaning the dataset significantly. Another advantage of Dataprep is that it loads only sample of records from original dataset, increasing responsiveness while cleaning the dataset.

I thought I’d share my learning here using a sample dataset ‘USA Presidential Campaign Donations’ (link) and try to understand Dataprep in-depth.

DataPreP can be launched from the Big Data part of google cloud console.

There are some public datasets available here for practising. To create your own flow, click on Create Flow on the home page of Dataprep. Once the flow is created, I had to add datasets and import them on to this flow, which can be achieved in multiple ways:

  1. Upload datasets (CSV, JSON, or relational table formats) from your PC to the console
  2. Import the datasets from Google cloud Data storage
Fig 2: Import datasets and add to the flow with the available options

As I was learning, I have loaded the available flow USA Presidential Campaign Donations, which has two pre-loaded datasets. Each flow has a ‘recipe’ which has the steps to clean/process the dataset. Click on edit recipe option (as shown in below screenshot) to preview samples (fig:4) from the original dataset.

Dataset selection by Dataprep

Dataprep selects a sample of records from the original dataset proportionally from the original distribution of the dataset. For example, if the original dataset has 200,000 records, Dataprep samples approximately 10,000 to 20,000 records for adding/editing a recipe. The picture in below shows the sample dataset on the console of Dataprep.

Fig 3: User Interface of Dataprep once the dataset is loaded

Once the dataset is loaded, I was able to see the sample records as shown in above picture. Here we see the type of data for each column and a histogram of its distribution. Once I had an understanding of the dataset and data types, I’ve proceeded on adding the recipe to make changes on this sample of records. Below are the steps I have implemented to process the dataset.

I wanted to remove the unnecessary columns or rename columns or change the data type (like changing string to number). I have clicked on add step on Dataprep as shown in the screenshot below (the transformation is on the right side of the picture). I have selected ‘Rename columns’ transformation and selected the column8 to change it’s name to ‘party_Affliation’:

Fig 4: Renamed column8 to ‘Party_Affliation’

Another important step of data cleaning is to remove the columns with majority of missing values i.e. more than 50%. In order to check the percentage of missing values from each column, or analyse the column , I just had to click on any column dropdown menu (next to the column name) and click on column details. To go back to the previous screen, click on the grid tab on the top left. The screenshot in below shows the all details of column5.

Fig 4: Detailed description of a column, 59 % of missing records

The above picture shows about 59% missing records and it is ideal to drop these type of columns. I have dropped this column by adding a new step to the recipe and select the transformation ‘delete columns’ and select the columns as shown in the screenshot in below. I have removed up to 14 columns which are sparse and irrelevant for modelling.

Fig 6: Deleting columns

Next important step in cleaning the dataset, is filling the missing values in the columns. We can either fill them with ‘NULL’ or fill with custom values. For ex: I wanted to fill the missing values in column7 (numerical column) with the column’s average value as shown in the picture below:

Fig 7: Fill/replace missing values by apply formulae. The light orange column is a preview of the transformation.

Below picture shows processing of column 14. Column14 is a categorical column (string type) with 15,710 unique values. Using Dataprep, I could reduce them to 5 main categories with the help of conditional transformations. The detailed steps are provided after the below screenshot.

Fig 8: Original column14 had 15,000 plus categories. Post transformation it has come down to 5 categories — column14_new

Steps added to recipe to reduce categories in column 14:

Steps 11 to 16 has big regular expression queries which helped me to reduce the number of categories significantly. Here, I have implemented if/else condition to check for the matched pattern. The set function will transform the entire record if it finds single match pattern. Eg: one of the record has ‘fire station office’, I have queried for ‘fire’ and used set function to transform the record with ‘fire station office’ to ‘fire’ and subsequently transform all the records in the column which has word ‘fire’ in it.

Once the set of steps are completed, I have run the job on the entire dataset by clicking on ‘Run Job’ button on the top right corner of the console. The processed dataset is saved in Google Cloud Storage bucket after the job is completed.

Fig 10: The execution of Run job starts the job as seen on the left side of the image

To replicate the recipe on some other dataset, download the wrangle/csv for the recipe (which can be opened with any text editor) and copy the steps one by one into the recipe of new flow. Whilst copying the steps, one needs to carefully ensure the correct column names (from the source data set) are specified, to avoid errors.

Fig 11: Downloading recipe

Thank you for reading this and I hope this helped you get started with DataPrep. Check out the rest of interesting stuff from our engineering blog here

Happy ‘Dataprep’ing. :)

Searce Engineering

We identify better ways of doing things!

Praveen kumar Arja

Written by

Data Scientist , Searce | AI/ML practitioner | Masters in Chemical Engineering from a forest | Football enthusiast.

Searce Engineering

We identify better ways of doing things!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade