One of the most important things while building an Machine Learning based model is to prepare/cleanse the raw data. Working on one such project gave me a chance to explore one of Google’s product — Cloud Dataprep. After a preliminary analysis of this dataset, my senior quickly realized that the dataset we are dealing with can leverage full capabilities of Google Cloud Dataprep for quicker data cleansing. I had just tested Cloud Dataprep at a high level before, and this was the opportunity to learn as much as I could about the application of Cloud Dataprep.
One of the main reasons to choose Dataprep is to reduce efforts of cleaning the dataset significantly. Another advantage of Dataprep is that it loads only sample of records from original dataset, increasing responsiveness while cleaning the dataset.
I thought I’d share my learning here using a sample dataset ‘USA Presidential Campaign Donations’ (link) and try to understand Dataprep in-depth.
DataPreP can be launched from the Big Data part of google cloud console.
There are some public datasets available here for practising. To create your own flow, click on Create Flow on the home page of Dataprep. Once the flow is created, I had to add datasets and import them on to this flow, which can be achieved in multiple ways:
- Upload datasets (CSV, JSON, or relational table formats) from your PC to the console
- Import the datasets from Google cloud Data storage
As I was learning, I have loaded the available flow USA Presidential Campaign Donations, which has two pre-loaded datasets. Each flow has a ‘recipe’ which has the steps to clean/process the dataset. Click on edit recipe option (as shown in below screenshot) to preview samples (fig:4) from the original dataset.
Dataset selection by Dataprep
Dataprep selects a sample of records from the original dataset proportionally from the original distribution of the dataset. For example, if the original dataset has 200,000 records, Dataprep samples approximately 10,000 to 20,000 records for adding/editing a recipe. The picture in below shows the sample dataset on the console of Dataprep.
Once the dataset is loaded, I was able to see the sample records as shown in above picture. Here we see the type of data for each column and a histogram of its distribution. Once I had an understanding of the dataset and data types, I’ve proceeded on adding the recipe to make changes on this sample of records. Below are the steps I have implemented to process the dataset.
I wanted to remove the unnecessary columns or rename columns or change the data type (like changing string to number). I have clicked on add step on Dataprep as shown in the screenshot below (the transformation is on the right side of the picture). I have selected ‘Rename columns’ transformation and selected the column8 to change it’s name to ‘party_Affliation’:
Another important step of data cleaning is to remove the columns with majority of missing values i.e. more than 50%. In order to check the percentage of missing values from each column, or analyse the column , I just had to click on any column dropdown menu (next to the column name) and click on column details. To go back to the previous screen, click on the grid tab on the top left. The screenshot in below shows the all details of column5.
The above picture shows about 59% missing records and it is ideal to drop these type of columns. I have dropped this column by adding a new step to the recipe and select the transformation ‘delete columns’ and select the columns as shown in the screenshot in below. I have removed up to 14 columns which are sparse and irrelevant for modelling.
Next important step in cleaning the dataset, is filling the missing values in the columns. We can either fill them with ‘NULL’ or fill with custom values. For ex: I wanted to fill the missing values in column7 (numerical column) with the column’s average value as shown in the picture below:
Below picture shows processing of column 14. Column14 is a categorical column (string type) with 15,710 unique values. Using Dataprep, I could reduce them to 5 main categories with the help of conditional transformations. The detailed steps are provided after the below screenshot.
Steps added to recipe to reduce categories in column 14:
Steps 11 to 16 has big regular expression queries which helped me to reduce the number of categories significantly. Here, I have implemented if/else condition to check for the matched pattern. The set function will transform the entire record if it finds single match pattern. Eg: one of the record has ‘fire station office’, I have queried for ‘fire’ and used set function to transform the record with ‘fire station office’ to ‘fire’ and subsequently transform all the records in the column which has word ‘fire’ in it.
Once the set of steps are completed, I have run the job on the entire dataset by clicking on ‘Run Job’ button on the top right corner of the console. The processed dataset is saved in Google Cloud Storage bucket after the job is completed.
To replicate the recipe on some other dataset, download the wrangle/csv for the recipe (which can be opened with any text editor) and copy the steps one by one into the recipe of new flow. Whilst copying the steps, one needs to carefully ensure the correct column names (from the source data set) are specified, to avoid errors.
Thank you for reading this and I hope this helped you get started with DataPrep. Check out the rest of interesting stuff from our engineering blog here
Happy ‘Dataprep’ing. :)