Building Data Pipelines Without a Single Line of Code

An intelligent and intuitive way of exploring data with Google Cloud Dataprep.

Harshit Tyagi
Nov 2 · 13 min read

Tired of writing tons of lines of python/java to build ETL pipelines or explore your datasets? Scared of programming to analyze datasets? Looking for a UI-based analytical platform that can help you wrangle those large datasets? Is integrating different tools a big unwieldy task for you?

Voila! All your prayers/queries have been answered with this amazing tool by Trifacta called Cloud Dataprep. It is an integrated partner service with Google Cloud Platform which enables numerous workflows in the data engineering and analysis domain.
In this post, you’ll not only learn how to use dataprep but also become aware of potential integrations with Google Cloud Platform(GCP) services. We’re going to build a data transformation pipeline with the help of Cloud Dataprep and Bigquery.

Overview of Cloud Dataprep

Dataprep is a web-based, serverless, no-ops, intelligent data service to visually perform exploratory analysis on your dataset. It helps you wrangle and explore structured and unstructured data to be processed further for machine learning, reporting dashboards, analytical frameworks, etcetera.
The amazing ability to suggest and predict the next data transformation to the loaded data makes the analysis a cakewalk for all those who hesitate to write code.
But where does dataprep position itself in the field of data engineering especially in the GCP infrastructure:

A GCP based data engineering pipeline typically consists of some or all of the cloud services in the image above.
The ingestion of data can take place at any component like Bigquery, Cloud Storage or the usual file upload. Then comes the major section where you engineer your data to prepare it for deeper model building and analysis purposes. This includes raw data to flow into one or more(as per your use case) of data processing engines like Cloud Dataprep, Cloud dataflow, and the transformed data is outputted and stored in either Bigquery or Cloud Storage.

Let’s dive into the tasks to build get pipeline running:

Breakdown of tasks

  1. Setting up your cloud console
  2. Importing data into BigQuery
  3. Connecting Bigquery to Dataprep
  4. Exploratory analysis on Dataprep web interface
  5. Data wrangling and pipeline recipe
  6. Scheduling Dataprep jobs

Shedding more light on each of the above:

1. Setting up your cloud console

Once you’re logged into your google cloud console, create a new project which is going to be used for this tutorial. Click the dropdown on the navbar at the top and click +NEW PROJECT.

After the project is created successfully, make sure that you have the newly created project selected in your environment reflected in the dropdown:

For more details on project creation and management, here is the link to the documentation.


2. Importing data into BigQuery

After setting up the development environment, we need to set up BigQuery which plays the role of both input and output storage in the pipeline. Here is how the flow would look like:

The main focus of this post is Dataprep but we’ll need BigQuery for importing the data(ingestion) and exporting the output after the transformation cycle completes in the Dataprep.

Instructions to importing data into Bigquery:

  1. From the navigation menu(hamburger icon) in the cloud console, click BigQuery. We will be taken to the BigQuery console where you have a left pane which enlists options like Query history, Saved Queries, etc and at the bottom, we have Resources under which our project must be displayed. Click and select your project and on the right, under the query editor, the option to CREATE DATASET becomes active.

2. Click CREATE DATASET and enter the Dataset ID as ecommerce_analytics which is going to be the name of our dataset under our project. With all other fields set as default, click theCreate dataset button at the bottom.

3. We have our dataset ready to store data in the form of tables. We are going to use the obfuscated Google Analytics 360 data from the Google Merchandise Store by querying a subset of it from the bigquery-public-data project. It comes pinned with every new project.
The query is using standard SQL. Type in or paste the SQL query in the query editor:

4. Click Run. Here is how the left pane should look now:

The query has created a new table named subset_user_sessions which contains 33,664 user sessions that occurred on 1st January 2017. You can preview the data by selecting the table in the left pane and then clicking thePreview option on the right.

Now that we have the data imported in BigQuery, let’s move on to Dataprep to link it to dataprep to explore this data.

3. Connecting BigQuery to Dataprep

Here, we will get started with Cloud Dataprep and connect our bigquery table with it. Following are the steps for the task:

Note: You need a Chrome browser as Cloud Dataprep works only with Google chrome.

  1. Open Dataprep from the navigation menu > Dataprep.
  2. You will be required to enable billing to use Dataprep, you’ll have to create a billing account if you don’t already have one on Cloud Console from the Billing section in the navigation menu.
  3. Accept the terms of service and then authorize Trifacta and Google LLC to access your account information.

4. Allow Trifacta to access project data by clicking theAllow button, it’s going to take a few minutes to sync up the account info:

5. Choose the Google account you want to sign in with and allow Trifacta to access that account information:

6. Once the account setup is done, you’ll be prompted for the first time set up with a dialog box to provide a location to save files that we create within Dataprep in Google Cloud Storage. Make sure that the right project is selected and then, click Continue.

7. Welcome to the main playground, Cloud Dataprep dashboard. Once we’re here, we start the connection procedure with BigQuery. Click the Create Flow button in the top-right.

8. Enter User Analytics Pipeline as the Flow Name and add a flow description.

9. It’s time to import the dataset from BigQuery, click Import & Add Datasets, select BigQuery from the left pane.

10. Click on your dataset and as your dataset tables get loaded click the Create Dataset icon as shown below:

On the right, a pane previews the table data, click the Import & Add to Flow button at the bottom.
The connection is successfully built and you’re now off to adding new recipes to the flow as the Add new Recipe option becomes available on the right.

4. Exploratory Analysis on Dataprep UI

With our BigQuery project data now imported into dataprep, we can start off with the exploratory data analysis.
The first and foremost step in the process of EDA is to make a connection with the data at hand which helps you assess the potential insights that can be drawn from the data. There are different ways to go about the initial data comprehension exercise, the way I do it is I read about the schema description if available and then I try to go through each column and prepare a list of good questions to answer from those columns or the combination of columns.
For more on EDA:

Taking it from where we left in the previous section, click on Add new Recipe button in the right pane and then click Edit Recipe to look at the data in the Transformer view.

You can read about the dataset here to understand the meaning of each column. So, let’s define a few questions and extract their answers using Dataprep’s tools and features. This will help us get our head around the data and the platform as well:

  1. What is the shape(# of rows and columns) of the dataframe?

At the bottom of the transformer view, you’ll find that our sample contains 32 columns, 12.5K rows, and 4 different data types.

2. Looking for column’s min, max, lower quartile, higher quartile functions.

To find out the statistical and pattern details of each column, click the dropdown in front of the name of the column and then select Column Detail. Let’s try to find out the min/max values for the following columns:

  • productQuantity

Clicking Column Details will take you to the summary and stats page:

The minimum quantity bought was 1 while the max was 50.

Similarly,
Max/Min Pageviews: 1/115
We’ll find out the min/max time spent on the site after correcting it in a few steps.

Note: The answers may differ for your subset.

3. What does the histogram represent under the column header?

The histogram depicts the categories of responses in the column. For example, version 2 Product Names can be seen in the above screenshot. If you hover over the bar, it tells you the frequency of that category in the column like this:

Now, using this let’s try to answer the next question.

4. Top 5 Countries w.r.t the number of sessions origination:

They are in the order: United States > United Kingdom > India > Canada > Japan

5. What does the color of the column bar signify?

You must have noticed that there is a bar under the column heading and it is of different colors signifying missing and mismatched data type values.

Grey color represents missing values while red denotes mismatched values. We’ll be correcting these in the next section.

6. What are the common sources of traffic?
To answer this, we’d need to look at the channelGrouping column from the schema description. There are 7 sources and the most common ones are:

Organic search > referral > direct

7. What are the common product categories?
To answer this, we’d need to look at the v2ProductCategorycolumn from the schema description. There are 31 categories and the most common ones are:

ACME > apparel > electronics

5. Data Wrangling and recipe making

Let’s build the transformational recipe now that we have got to know the problems associated with the dataset we have. We need to define the data cleaning steps for columns we find problems with, just like the mismatched time column in the dataset above.

  1. Correcting the data type of the column timeOnSite : Click the dropdown menu in front of the column name, from there Change type > Integer. It will change the data type of all the values to Integer which will be the number of seconds that the user spent on site.

An important step here is to keep an eye on the recipe icon which if you click on will show you a step added to it like this:

This recipe will keep updating itself as we work through our data wrangling process.

2. Dropping the unused columns: If you observe carefully, there are a few columns that have whole grey bars which indicate that these have all null values(or no valid values). These columns have no use and are redundant.

Note: We can not delete all the columns with all null values as we are working on a subset of data. Read the description of the column if you are not sure about a column before deleting.

How to delete the column?
Click the grey bar under the column header, dataprep will highlight all the missing value rows.
Now, click the menu dropdown > Click Delete.

Repeat the process for itemQuantity and itemRevenue columns as we’ll be using productQuantity and productRevenue columns as they have values in it which will come in handy for revenue reporting purposes.

Now, the updated recipe looks like:

3. Deleting duplicate rows: Another very common and useful step in EDA is dropping duplicate rows from the data for which in pandas we use drop_duplicate function, here we’re going to use the toolbar on the top.

Click the Filter rows icon and then select Remove duplicate rows.
You’ll be asked if you want to add this step in the recipe as a step to be run for all the incoming dataset in the future.
Click Add.

4. Adding a custom formula to the recipe: The totalTransactionRevenue column contains the total revenue generated from the transaction. From the schema documentation, you can see that it has been multiplied by 10⁶ which is not clear from the values.

From the dropdown menu > Calculate > Custom formula:

In the right pane that opens up, add the formula:

We have added a custom DIVIDE formula to the totalTransactionRevenue column and the preview can be shown in the transformer view. Click Add and it gets added to the recipe.

5. Merging columns: Merging/Joins are another very important and common task in EDA and Dataprep can do that in the most effortless manner as well. Let’s create a primary key column for each of these user sessions.

We are using fullVisitorId, time and v2ProductName for this task as it gives unique value for each session separated by a | and name the new column as main_id.

Click Add.

These are a few steps that I’ve walked you through, there is a bunch of operations and manipulations you can pull of using this intelligent tool.

Here is how the recipe looks like now:

6. Running and Scheduling Dataprep jobs

With the pipeline now ready to run, click Run Job on the top-right corner.
Since we need to add the transformed data back to BigQuery in a new table for analytical or modeling purposes.
Click Add Publishing Action on the right and then select BigQuery as the output destination, the synced dataset will become available, here you need to create a new table using Create table option in the right pane.

Click Add at the bottom and here you go:

Click Run Job in the bottom right corner of the page after setting your region and machine type that you need. The job will take a few minutes and once it is done you can check your results in BigQuery and the table with transformed data awaits your eyes!!

Scheduling jobs

Another important feature is the ability to schedule the job runs on dataprep. On the Flows page, click the top 3 dots icon and then select Schedule Flow.

You can choose the timezone, frequency, days and time as shown below:

Conclusion

Who would have thought of simplifying a laborious task of pipeline creation to this extent. The Google Cloud team is adding some unmatched tools and features to redefine the meaning of cloud computing as we speak.
Fascinated as I was after estimating the number of man-hours dataprep has saved us, there isn’t a more useful, well-connected and UI-friendly tool. Many more interesting workflows are to be talked about and I’m elevated to unravel the stories I’ve to tell about data engineering on GCP. Stay tuned!!

PS: If you found this post helpful, don’t forget to clap and follow me for more on Data Science/Engineering, Statistics, and AI.

The Startup

Medium's largest active publication, followed by +542K people. Follow to join our community.

Thanks to The Startup

Harshit Tyagi

Written by

Let’s upgrade ourselves before those apps | Data Science Enthusiast | Entrepreneur | Health & Nutrition

The Startup

Medium's largest active publication, followed by +542K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade