Introduction to Cloud Dataprep

A quick introduction of Google Cloud Dataprep as a tool to prepare and analyze a dataset. By Alan Kohler

datadice
Data School
6 min readAug 23, 2021

--

Photo by Joshua Sortino on Unsplash

Cloud Dataprep is an intelligent data preparation web application on the Google Cloud Platform that has been offered by Trifacta, a privately owned software company with headquarters in San Francisco.

This tool is very useful for data analysts and data scientists to visually explore, clean, and transform a dataset using a simple workflow of drag and drop transformations in a user-friendly interface instead of writing complicated code.

It is a serverless and fully managed service that can drastically reduce and simplify the amount of work required to get the data ready for analysis, reporting, and machine learning.

In this blog post, I´ll give an introduction to this tool. The target audience is people on the beginner and intermediate levels.

Photo by Trifacta

Part A — Create Flows and Import Datasets

In the first step, it is required to create a flow, a container object to access and manipulate a dataset.

It is possible to import multiple datasets from a staging google storage bucket, by uploading a local file or directly from BigQuery.

The file formats supported are CSV, JSON, Plain text, LOG, TSV, Parquet, Avro, Google Sheets, or Excel (XLS/XLSX).

Once the data has been imported, Dataprep will automatically detect the schema, the data type, the distribution of values, and outliers.

Here below you can find a summarized sequence of steps required to create a flow and import a dataset:

Step 1 — Create and name a flow (Blank or from Template)

Step 2 — Click on connect your data

Step 3 — Click on Import a dataset by selecting the appropriate source

Step 4 — Click on import and add to flow

Part B — Apply Recipe Transformations

By clicking on edit transformation, Dataprep will load an initial random sample of the dataset. This initial sample will be the first 10MB of rows in the dataset (or the complete dataset if it is smaller than 10MB).

Additional samples based on certain criteria might be created after for deeper analysis of certain features, by clicking on the top left corner on “initial data” and then clicking to collect a new sample.

The samples can be created at random, by selecting the first rows, by a filter-based condition, anomaly-based, stratified, or cluster-based condition. To filter the column by data type it is required to analyze the column of a specific type (for example all the timestamp columns).

It is possible to select a specific interval of data by directly clicking on the distribution graph of a particular column.

Once you select a column, a column category, and/or eventually a specified range, on the right will appear a suggestion box with a list of potentially relevant transformations to be applied.

Dataprep Wrangler Page with Suggestion Box

The available transformations can be found on the top bar and include for example the following main functionalities:

Data cleaning:

  • Change the name of columns
  • Change the data type of a column
  • Removing duplicates
  • Modify, extract or concatenate strings
  • Dates formatting

Data Discovery:

  • Filter data according to specified conditions
  • Identify outliers
  • Calculate aggregations like count or sum

Data Enrichment:

  • Generate primary keys
  • Joining multiple tables
  • Adding metadata and lookup tables

Advanced Operations:

  • Split columns
  • Pivoting tables
  • Manipulating arrays and JSON files
  • Identify missing and mismatched data

A reference explanation of all the transformations available in Dataprep can be found here.

Dataprep Wrangler Page

On the screenshot above you can see in the red box the available functions and in the blue box the column filter.

Here below you can find a summarized sequence of the steps required to apply the transformation to a dataset:

Step 5 — Click on edit recipe

Step 6 — Select a column or a column group from the filter icon

Step 7 — Eventually select a range of values from the distribution graph or by editing the transformation wrangler code

Step 8 — Apply multiple recipe transformations and/or filter columns

Step 9 — Create a new sample of the data and repeat the steps above until you are satisfied

Part C — Run Dataflow Job

Once you are satisfied with the current transformation recipe (i.e. the list of transformations to be applied), eventually applied to multiple samples, you can consolidate the dataset into a new file by running a dataflow job.

The output location is set by default to a new cloud storage bucket, but it can also be imported directly into Big Query if necessary. You can follow the dataflow job directly into the Dataflow console. The data can be exported in a compressed CSV, a compressed JSON file, or in Avro. It is also possible to use Automator to schedule the execution in Dataflow periodically or according to a specific schedule.

Dataprep Run Job Page

Here below you can find a summarized sequence of the steps required to export the transformed dataset:

Step 10 — Click on Run on the top right corner

Step 11 — Inside publishing action select the destination file name and type (either in Cloud Storage or Big Query). It needs to be in the same location of the source file

Step 12 — Eventually add a SQL script

Step 13 — Click on Run on the bottom right corner

Step 14 — Wait for the Dataflow Job to complete and you will find your new transformed dataset in the specified location

Dataflow Pipeline

Limitations

The size of the samples is limited to 10MB and it is possible to directly upload a file of a maximum of 1 GB.

Although the import supports a variety of file encoding types, the application only displays UTF-8 encodings (UTF-32 is not supported).

Furthermore, the maps generated in visual profiling with a State and a Zip Code column only display data for the United States.

According to the edition and the Dataprep subscription selected there might be further limitations.

Conclusion

In this blog post, the Cloud Dataprep Tool was briefly introduced, with a simple introduction of steps in order to use it, some examples of potential transformation to be applied and its main limitations. This tool is particularly useful to clean and analyze a dataset before doing any other complicated ETL transformation.

The best way to familiarize yourself with Cloud Dataprep is by trying it out directly. Have fun with Data Analysis!

More detailed information about specific Dataprep functions might be published in future blog posts from datadice.

Further Links

This post is part of our new Google Cloud series in the Data School of datadice. We will be sharing the best insights that we’ve learned through years of creating data solutions.

If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.

--

--

datadice
Data School

Data Analytics Boutique, based in Coburg, DE. Building modern, customized and cloud-based data warehouse solutions. https://www.datadice.io/