ETL on Google Cloud with Dataprep

Published in

Geek Culture

7 min readJul 5, 2021

NB: It is assumed that the reader has a basic understanding of how Big Data and Google Cloud Platform, or at least any Cloud Platform, work.

What is ETL?

ETL refers to the three sequential steps in data integration: Extract, Transform and Load. This process is used to blend data from disparate sources such as relational databases, flat files, web services, RSS feeds, XML, JSON and IoT devices. In the course of this sequence of actions, data is extracted from a source or multiple sources, transformed into a uniform format suitable for analysis, and loaded into a data warehouse or any other repositories of integrated data.

ETL makes it possible for businesses to gather enormous quantity of data from various heterogeneous sources and bring them together in a single centralized system, and effectively transform that into actionable business intelligence.

Extract, Load, Transform (ELT) is an alternative to ETL used to store data in data lakes in raw formats before the transformation phase.

Google Cloud Dataprep
Designed by Trifacta, Dataprep is a fully managed Google cloud data service for exploring, cleaning, structuring and enriching structured and unstructured data.

Dataprep is highly visual and interactive. This allows users to explore data visually by transforming the file into tables and visual representations. It becomes easier to manipulate the data thus accelerating the entire process.

Data extraction and transformation can be a coding round, but this is time-intensive and prone to errors. With its well developed UI, Dataprep suggests the data transformations and creates a more efficient workflow without the hassles of writing codes.

To run our data processing pipeline, we would rely on Google Cloud Dataflow. Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. It is serverless, cost-effective, fast and can handle both streaming and batch jobs.

Dataprep ELT pipeline architecture

Open the Google Cloud console and sign in.
You can view the list of Google Cloud Products and Services by clicking the Navigation menu at the top-left.

The next step is to enable the Dataprep API. In the Cloud Console, enter “Dataprep API” in the top search bar and click on the result for Dataprep API.

Step 1: The Extraction Process

In this first step of the ETL process, structured and unstructured data are imported and consolidated into a single repository. Data can be extracted from such a disparate source as:

On-premises environments.
Cloud Storage.
Data warehouses.
Flat files.
Databases.
Applications

image source: https://cloud.google.com/dataprep

Click the Navigation Menu, then scroll down to Big Data to choose Dataprep.

On the Dataprep homepage, click Import Data to commence the Ingestion stage.

You can drag and drop or choose a file from local repository, and choose from Google Cloud Storage, directly from Google Sheets and/or BigQuery.

By clicking the New button at the bottom-left corner, datasets can also be accessed from other sources like Microsoft SQL Server, Oracle database, IBM DB2, Cassandra and cloud applications like Salesforce.

Alternatively, you can click Create Flow to begin the process.

You can add data from different sources in the Extraction phase. After choosing your data, click on Use in new flow…

Then you have your dataset displayed before you in the Transformer grid in table format and bar chats.

Step 2: The Transformation process

Transformation is generally considered to be the most important part of the ETL process. Data transformation creates compatibility between various datasets and enhance the quality and integrity of the datasets.

Dataprep allows you to easily cleanse data using point and click interactions.

Transformer grid provides visual information about the dataset and suggests transformation with each user interaction.

In the image above, the data type is displayed on the top-left (ABC which represents String format) and the Column name is Index. At the top-right of each column is a column menu option which is an easy way to start the transformation process.

Beneath these sets is the horizontal Data quality bar which indicates the health status of the data in each column: grey spots depict missing values while red represent mismatched values. The bar chats present the data in vertical bars with heights proportional to the values that they represent.

We transform a dataset using a recipe. A recipe is a sequence of steps that transforms the selected dataset into a new dataset. These steps could include mathematical operations, standardization, deduplications, cleaning mismatched and missing data, structuring, sorting, combining datasets and other tasks to improve the quality of the dataset. Recipe used the Wrangle language.

We can also select one of the icons on top of the transformer grid to transform one or more columns.

We can view all the transformation steps in the panel on the right. Click Edit recipe to explore and modify the recipes.

So far, all the steps only apply transformation to the data displayed on the browser. The source data is not modified by all the transformations.

Once you are satisfied with the transformations, the next step is to generate the result by running all the recipe steps using dataflow and publishing new data output.

Click Run at the top-right to run all the recipe steps.

Choose Dataflow and click Run.

You can view the status and progress of the current job in the jobs workspace. You get an overview, the output destination, profile, dependency graph of the job. You can play around these tabs to learn more about your transformed data.

You can download your result file at output destination.

You can click the profile tab to validate that the data output is accurate.

You can also view the job graph and metrics on Dataflow by clicking View Dataflow job at the top-right corner.

Step 3: The Loading process

The final step in the ETL process is to load the transformed data into a new destination. You can load your new data into BigQuery, Google Cloud Storage, download it to a local host or any other repository of integrated data.

By default, your data will be published at this location. To add a new location, you can use the add publishing action and choose to publish in different file format or BigQuery table.

By default, the result is published in csv format although other formats are available.

Data can be loaded all at once (Full load) or at scheduled intervals (incremental load).

You can schedule the data prep job to run on a regular basis keep the data fresh. This can be done by adding schedule at schedule flow and decide on the frequency you want your job refreshed.

There are many factors which may dictate the choice of a Full load or an Incremental load. Whatever the case, it is recommended that you run the job during off peak periods for better performance if you select Full load option. Incremental loads can be run at other times.

Hopefully, there would be a comprehensive post on how to schedule jobs on Dataflow.