How To Load Data Into Google BigQuery on Dataproc and AutoML

Dash Desai
Feb 23 · 4 min read

What is Google Dataproc?

Dataproc is a low-cost, Google Cloud Platform integrated, easy to use managed Spark and Hadoop service that can be leveraged for batch processing, streaming, and machine learning use cases.

What is Google BigQuery?

BigQuery is an enterprise grade data warehouse that enables high-performance SQL queries using the processing power of Google’s infrastructure.

Load Data Into Google BigQuery and AutoML

In this blog, we will review ETL data pipeline in StreamSets Transformer, a Spark ETL engine , to ingest real-world data from Fire Department of New York (FDNY) stored in Google Cloud Storage (GCS), transform it, and store the curated data in Google BigQuery.

Once the transformed data is made available in Google BigQuery, it will be used in AutoML to train a machine learning model to predict the average incident response time for the FDNY.

Sample Data

The dataset is made available through the NYC Open Data website. The 2009–2018 historical dataset contains average response times of the FDNY. The data is partitioned by incident type (False Alarm, Medical Emergency, and so on), borough, and the number of incidents during a particular month.

Here’s what the sample FDNY data looks like:

YEARMONTH,INCIDENTCLASSIFICATION,INCIDENTBOROUGH,INCIDENTCOUNT,AVERAGERESPONSETIME 2009/07,All Fire/Emergency Incidents,Citywide,40850,04:27 2009/07,False Alarm,Citywide,2655,04:07 2009/07,Medical Emergencies,Manhattan,4895,04:17 2009/07,Medical False Alarm,Citywide,408,04:13 2009/07,NonMedical Emergencies,Manhattan,4446,04:51 2009/07,NonStructural Fires,Citywide,1495,04:19 2009/08,All Fire/Emergency Incidents,Citywide,41723,04:30

Data Pipeline Overview

Before running the pipeline in StreamSets Transformer, you can preview the pipeline against the configured Dataproc cluster to examine the data structure, data types, and verify the transformations at every stage. This is also a great way to . For more information on pipeline preview, refer to the debug data pipelines documentation.

Remove Future Information

Because this is a historical dataset and we’re using it to train a machine learning model, we need to remove information that would not be known at the beginning of the month. In this case, that is INCIDENTCOUNT To remove this field from every record, we’ll use a Field Remover processor.

Data Transformations

Labels or target variables in machine learning models are of numeric data type. In this case, the field value of AVERAGERESPONSETIME is transformed in the following steps:

  • Remove “:” using expression — replace(AVERAGERESPONSETIME,”:”,””)
  • Convert from time to seconds and from string datatype to integer using Spark SQL expression — round((AVERAGERESPONSETIME / 100) * 60 + (AVERAGERESPONSETIME % 100))

Data Pipeline Execution

Running the StreamSets Transformer data pipeline displays various metrics in real-time. For example, batch processing time taken by each stage as shown below. This is a great way to start looking into fine tuning the processing and transformations.

Google BigQuery

Once the pipeline runs successfully, the Google BigQuery table is auto-created, if it doesn’t already exists, and the transformed data is inserted into the table. This dataset is then readily available for querying as shown below.

The transformed data stored can then be imported directly from the BigQuery table for training a machine learning model in AutoML.

Using AutoML you can build on Google’s machine learning capabilities and create custom machine learning models.

Import Data

Select Target Column

Train Machine Learning Model

That’s it! We went from loading raw, real-world data into Google BigQuery to creating a machine learning model in AutoML without any coding or scripting!

Build Your Spark ETL and ML Data Pipelines

It goes without saying that training models, evaluating them, model versioning, and serving different versions of the model are non-trivial undertakings and that is not the focus of this post. That said, however, StreamSets Transformer makes it really easy to load data into Google BigQuery and AutoML.

Checkout these helpful resources and get started quickly with running your Spark ETL data pipelines . Here are some other technical blogs related to Machine Learning that you might be interested in reading:

Learn more about StreamSets For Google Cloud Platform .

Originally published at on February 23, 2021.

Data Scientists must think like an artist when finding a solution

Sign up for AI & ART


A weekly collection of the best news and resources on AI & ART Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Dash Desai

Written by

Director of Platform, Technical Evangelism @ StreamSets | #DataScience | #MachineLearning | #BigData | #CloudComputing | #Travel | #Photography @natureunraveled

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store