Dataprep is all you need for a data preparation job on GCP

Zhejie Hu
FLUX Tech Blog
Published in
5 min readSep 25, 2021

ETL pipelines are always crucial in all applications involving data nowadays. In the good old days, simple Python scripts will do pretty well, but as times change, the capability of handling real-time data streams and batch big data has become essential, especially for companies like us, FLUX Inc. As of right now, we have to deal with more than 10TBs of data per day and millions of server access logs generated by our performance analytics and cloud providers.

A typical data pipeline architecture

Traditionally, data pipelines creation focuses on coding in Java/Python with packages like Hadoop, Spark, or Beam, which requires intensive work in a span of weeks or months. But for a startup that is moving fast from ideas to ideas, swiftness of development is a must. That’s the reason why I want to introduce a solution that we use heavily in our company for quick data pipelines preparation with a graphical UI and automated code generation functionality — Dataprep.

Google Cloud Dataprep is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning. This article is based on a mass of marketing data from Kaggle and our task is to complete a data preparation job to obtain clean, structured data through Dataprep. It involves no coding to be time-saving, and varieties of embedded functions can meet the vast majority of needs. A simple flow chart of the data preparation jobs is shown as follows:

Made by Lucidchart

Now, let’s start our journey!

Preparations

Before Building the data flow on DataPrep, you need to do some preparation works.

  • Preparing a bucket of Cloud Storage or a table of BigQuery for the output.

We chose Cloud Storage and built a target bucket named flux_target in the region of asia-northeast1.

Create a new bucket

Configuration for Running

This step is important and can be done after the next step alternatively. To avoid the error like the following one:

”Cannot read and write in different locations: source: asia-northeast1, destination: US”, “reason”: “invalid”

You need to change the configuration for the run. The page for the configuration will be shown after you click the icon of your account in the left bottom of the home page and click the item named Preferences.

You need to verify the following things:

  • Upload directory, Job run directory, and Temp directory are created in the target bucket (e.g. flux_target) or other buckets in the same region,
  • Regional Endpoint and Zone of execution settings should be consistent with asia-northeast1.

Once the above configurations are completed, all you need to do is building a data flow.

Building a Customized Dataflow

Dataprep provides a template of data flow from Cloud Storage to BigQuery. As our raw data is in BigQuery, we choose to start from scratch to create a new flow.

Now, you have a data flow with three components:

  • Dataset (Input),
  • Recipe for Transforming,
  • Output for transformed data.

Add Dataset

You can click the icon(“+”) of Dataset to add your raw data from a specific table of BigQuery, or a bucket of Cloud Storage. Uploading the raw data from a local computer is also allowed, and you can find the data file in Upload directory later. In this article, we take the marketing data from Kaggle as an example. After uploading the data and rename the data flow as marketing_dataflow, we get the following results.

Edit Recipe

After you click on Edit recipe at the right top of the page, a sample of your dataset will be pulled out.

You can have a brief view of all the columns of data with their distributions and data types. It also tells you the columns with some missing values or inappropriate data types. Cleaning data becomes so easy as you just click the columns with problems, then Suggestions will be on the right part of the page. For example, if we click the Income column as some missing values exist, then Delete row is suggested.

You can click the Add to add this step into the recipe.

Once you click New Step, a new transformation is preparing to be added to the recipe, and you can find many transformations here. Some useful ones are listed as follows.

  • Formulas for numeric value transformation
  • Rename or delete rows
  • Find and replace values
  • Count matches
  • Extract data to generate a new row
  • Group by, merge, split
  • Objects transformation (e.g. string to arrays)
  • A variety of statistical functions

For example, we use the binary encoder to transform Education to numeric values, and the preview results are shown as follows:

You can find more usage examples and documentation about how to manage good recipes here.

Manage Output and Run Your Job

This is the simplest step if you have done the Configuration for Running. Once you have completed the recipe, click the run on the top right of the page. You can add manual destinations or scheduled destinations through Add or Edit.

Finally, get your job running. You can see the state of the job running in DataFlow as follows:

It will take a few minutes to run the job. Once the job is completed successfully, you can get details of transformations and find your prepared data in the target bucket (e.g. flux_target).

Thanks for reading, enjoy your Dataprep!

If you want to see more tutorials like this, don’t forget to subscribe to our publication!

--

--