Introduction to Cloud Dataprep
A quick introduction of Google Cloud Dataprep as a tool to prepare and analyze a dataset. By Alan Kohler
Cloud Dataprep is an intelligent data preparation web application on the Google Cloud Platform that has been offered by Trifacta, a privately owned software company with headquarters in San Francisco.
This tool is very useful for data analysts and data scientists to visually explore, clean, and transform a dataset using a simple workflow of drag and drop transformations in a user-friendly interface instead of writing complicated code.
It is a serverless and fully managed service that can drastically reduce and simplify the amount of work required to get the data ready for analysis, reporting, and machine learning.
In this blog post, I´ll give an introduction to this tool. The target audience is people on the beginner and intermediate levels.
Part A — Create Flows and Import Datasets
In the first step, it is required to create a flow, a container object to access and manipulate a dataset.
It is possible to import multiple datasets from a staging google storage bucket, by uploading a local file or directly from BigQuery.
The file formats supported are CSV, JSON, Plain text, LOG, TSV, Parquet, Avro, Google Sheets, or Excel (XLS/XLSX).
Once the data has been imported, Dataprep will automatically detect the schema, the data type, the distribution of values, and outliers.
Here below you can find a summarized sequence of steps required to create a flow and import a dataset:
Step 1 — Create and name a flow (Blank or from Template)
Step 2 — Click on connect your data
Step 3 — Click on Import a dataset by selecting the appropriate source
Step 4 — Click on import and add to flow
Part B — Apply Recipe Transformations
By clicking on edit transformation, Dataprep will load an initial random sample of the dataset. This initial sample will be the first 10MB of rows in the dataset (or the complete dataset if it is smaller than 10MB).
Additional samples based on certain criteria might be created after for deeper analysis of certain features, by clicking on the top left corner on “initial data” and then clicking to collect a new sample.
The samples can be created at random, by selecting the first rows, by a filter-based condition, anomaly-based, stratified, or cluster-based condition. To filter the column by data type it is required to analyze the column of a specific type (for example all the timestamp columns).
It is possible to select a specific interval of data by directly clicking on the distribution graph of a particular column.
Once you select a column, a column category, and/or eventually a specified range, on the right will appear a suggestion box with a list of potentially relevant transformations to be applied.
The available transformations can be found on the top bar and include for example the following main functionalities:
Data cleaning:
- Change the name of columns
- Change the data type of a column
- Removing duplicates
- Modify, extract or concatenate strings
- Dates formatting
Data Discovery:
- Filter data according to specified conditions
- Identify outliers
- Calculate aggregations like count or sum
Data Enrichment:
- Generate primary keys
- Joining multiple tables
- Adding metadata and lookup tables
Advanced Operations:
- Split columns
- Pivoting tables
- Manipulating arrays and JSON files
- Identify missing and mismatched data
A reference explanation of all the transformations available in Dataprep can be found here.
On the screenshot above you can see in the red box the available functions and in the blue box the column filter.
Here below you can find a summarized sequence of the steps required to apply the transformation to a dataset:
Step 5 — Click on edit recipe
Step 6 — Select a column or a column group from the filter icon
Step 7 — Eventually select a range of values from the distribution graph or by editing the transformation wrangler code
Step 8 — Apply multiple recipe transformations and/or filter columns
Step 9 — Create a new sample of the data and repeat the steps above until you are satisfied
Part C — Run Dataflow Job
Once you are satisfied with the current transformation recipe (i.e. the list of transformations to be applied), eventually applied to multiple samples, you can consolidate the dataset into a new file by running a dataflow job.
The output location is set by default to a new cloud storage bucket, but it can also be imported directly into Big Query if necessary. You can follow the dataflow job directly into the Dataflow console. The data can be exported in a compressed CSV, a compressed JSON file, or in Avro. It is also possible to use Automator to schedule the execution in Dataflow periodically or according to a specific schedule.
Here below you can find a summarized sequence of the steps required to export the transformed dataset:
Step 10 — Click on Run on the top right corner
Step 11 — Inside publishing action select the destination file name and type (either in Cloud Storage or Big Query). It needs to be in the same location of the source file
Step 12 — Eventually add a SQL script
Step 13 — Click on Run on the bottom right corner
Step 14 — Wait for the Dataflow Job to complete and you will find your new transformed dataset in the specified location
Limitations
The size of the samples is limited to 10MB and it is possible to directly upload a file of a maximum of 1 GB.
Although the import supports a variety of file encoding types, the application only displays UTF-8 encodings (UTF-32 is not supported).
Furthermore, the maps generated in visual profiling with a State and a Zip Code column only display data for the United States.
According to the edition and the Dataprep subscription selected there might be further limitations.
Conclusion
In this blog post, the Cloud Dataprep Tool was briefly introduced, with a simple introduction of steps in order to use it, some examples of potential transformation to be applied and its main limitations. This tool is particularly useful to clean and analyze a dataset before doing any other complicated ETL transformation.
The best way to familiarize yourself with Cloud Dataprep is by trying it out directly. Have fun with Data Analysis!
More detailed information about specific Dataprep functions might be published in future blog posts from datadice.
Further Links
This post is part of our new Google Cloud series in the Data School of datadice. We will be sharing the best insights that we’ve learned through years of creating data solutions.
If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.