Engineer Data in Google Cloud — Google Cloud Challenge Lab Walkthrough
Intro to Challenge Labs
Google provides an online learning platform called Google Cloud Skills Boost, formerly known as QwikLabs. On this platform, you can follow training courses aligned to learning paths, to particular products, or for particular solutions.
One type of learning experience on this platform is called a quest. This is where you complete a number of guided hands-on labs, and then finally complete a Challenge Lab. The challenge lab differs from the other labs in that goals are specified, but very little guidance on how to achieve the goals is given.
I occasionally create walkthroughs of these challenge labs. The goal is not to help you cheat your way through the challenge labs! But rather:
- To show you what I believe to be an ideal route through the lab.
- To help you with particular gotchas or blockers that are preventing you from completing the lab on your own.
If you’re looking for help with challenge lab, then you’ve come to the right place. But I strongly urge you to work your way through the quest first, and to try the lab on your own, before reading further!
With all these labs, there are always many ways to go about solving the problem. I generally like to solve them using the Cloud Shell, since I can then document a more repeatable and programmatic approach. But of course, you can use the Cloud Console too.
The “Engineer Data in Google Cloud” Lab
This was a fun lab! It is part of the Google Cloud Professional Data Engineer Certification learning path.
It tests your knowledge of BigQuery and of BigQuery ML. BigQuery ML allows quick creation of machine learning models directly from within BigQuery, using SQL. So BigQuery ML is an excellent choice when you already have tabular data and you know your way around SQL.
BigQuery models are built and used in four phases, like this:
Scenario
I’m going to walk you through this Challenge Lab.
We’re told:
“You have started a new role as a Data Engineer for TaxiCab Inc. You are expected to import some historical data to a working BigQuery dataset, and build a basic model that predicts fares based on information available when a new ride starts. Leadership is interested in building an app and estimating for users how much a ride will cost. The source data will be provided in your project.”
What we already know:
- We have been provided with a dataset called
taxirides
, with imported historical data in the tablehistorical_taxi_rides_raw
. We will explore this data, and use it to train our model. - We are also given another table:
taxirides.report_prediction_data
. We will be applying our model against this data to make predictions. Consequently, we need to be mindful that our model can only be built using fields that are present inreport_prediction_data
, or using fields that can be calculated using data in this table.
Objectives
- Setup
- Explore
- Clean The Data
- Create the Model
- Evaluate the Model
- Make Predictions
1. Setup
We are provided with a number of variables that we should make a note of. Your data will look differnet. But for the purposes of this page, these are the values I’m using:
2. Explore the Data
Let’s first examine the data by running this SQL query in BigQuery:
WITH
daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS fare_amount_797,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
trip_distance,
passenger_count AS passengers
FROM `taxirides.historical_taxi_rides_raw`, daynames
)
SELECT *
FROM taxitrips
Notes about this query:
- We add up
tolls_amount
andfare_amount
, and return the sum asfare_amount_797
. This is the value we will want to predict with our model; i.e. it is the label. - We are dynamically creating a table called
taxitrips
by joining thetaxirides.historical_taxi_rides_raw
table withdaynames
.
The results look something like this:
3. Clean the Data
We need to clean the data in order to create our model. There are many ways we could do this. One good option is to visually clean the data using Dataprep, and then execute the clean using a Dataprep-generated Dataflow pipeline.
But personally, I think it’s easy enough to clean the data using SQL, and generate the new table directly using this SQL.
Here’s my guidance:
- You need to save the cleaned data in a new table, with the name we were given. This will be the training data for our model.
- Thus, we need to include the features we want to test in our model. The trick here is that we want our model features to ideally have the same field names as the test data we will test the model on later. So it makes sense to make the field names look the same as the prediction time data.
- These are the features I wanted to test:
- Hour of Day
- Pick up address
- Drop off address
- Distance
- Number of passengers - However, although
distance
is a field in our historical data, it is not a field present in ourreport_prediction_data
. So, we can’t just usedistance
. But, as strongly hinted by the lab instructions, we can calculate distance using the BigQuery geo functions. - We need to follow the rules given in the lab. For me, these rules include:
- Only include data wheretrip_distance
is greater than the supplied value (i.e.2
).
- Only include data wherepassenger_count
is greater than the supplied value (i.e.2
).
- Only include data wherefare_amount
is greater than the supplied value (i.e.$3
).
- Remove fields with unreasonable geometric coordinates.
- Ensure the number of rows in the cleaned data is under 1 million. (The source data has over 1 billion rows.)
Let’s start by doing this:
WITH
daynames AS (
SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek
),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS fare_amount_797,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
trip_distance,
passenger_count AS passengers
FROM `taxirides.historical_taxi_rides_raw`, daynames
WHERE
trip_distance > 2 AND
passenger_count > 2 AND
fare_amount >= 3 AND
ABS(pickup_longitude) <= 90 AND pickup_longitude != 0 AND
ABS(pickup_latitude) <= 90 AND pickup_latitude != 0 AND
ABS(dropoff_longitude) <= 90 AND dropoff_longitude != 0 AND
ABS(dropoff_latitude) <= 90 AND dropoff_latitude != 0
)
SELECT * FROM taxitrips
This is a good start. It filters according to the rules, removes a few unusual pickup and dropoff coordinates, and ultimately reduces the record count from over 1 billion to about 72 million. But we need to get to under 1 million. So here’s a cool trick:
WITH
params AS (
SELECT
1 AS TRAIN,
2 AS EVAL
),
daynames AS (
SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek
),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS fare_amount_797,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
trip_distance,
passenger_count AS passengers
FROM `taxirides.historical_taxi_rides_raw`, daynames, params
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 150)
IN (params.TRAIN, params.EVAL) AND
trip_distance > 2 AND
passenger_count > 2 AND
fare_amount >= 3 AND
ABS(pickup_longitude) <= 90 AND pickup_longitude != 0 AND
ABS(pickup_latitude) <= 90 AND pickup_latitude != 0 AND
ABS(dropoff_longitude) <= 90 AND dropoff_longitude != 0 AND
ABS(dropoff_latitude) <= 90 AND dropoff_latitude != 0
)
SELECT * FROM taxitrips
In this new query, we’ve converted the timestamp to a unique number, and then we apply a mod
function to this number, using mod 150
. If the result is 1
or 2
, then we keep the row. Thus, we are ultimately reducing the size of the data 75-fold, which gets us to under 1 million.
Of course, we could have achieved the same reduction just with mod 75
. But by creating these two enumerated constants - params.TRAIN
and params.EVAL
, we have the ability to reuse this query to return half the data as training data, and the other half as evaluation data. We might not use this, but it’s good to be able to.
However, we still have a problem. We’re stil generating our cleaned data using trip_distance
, but this field won’t be available to us in our test data. So instead, we need to calculate distance. We can do it like this:
CREATE OR REPLACE TABLE taxirides.taxi_training_data_388 AS
WITH
params AS (
SELECT
1 AS TRAIN,
2 AS EVAL
),
daynames AS
(SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek),
taxitrips AS (
SELECT
(tolls_amount + fare_amount) AS fare_amount_797,
pickup_datetime,
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickup_longitude AS pickuplon,
pickup_latitude AS pickuplat,
dropoff_longitude AS dropofflon,
dropoff_latitude AS dropofflat,
ST_Distance(ST_GeogPoint(pickup_longitude, pickup_latitude),
ST_GeogPoint(dropoff_longitude, dropoff_latitude)) AS trip_distance,
passenger_count AS passengers
FROM `taxirides.historical_taxi_rides_raw`, daynames, params
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),150)
IN (params.TRAIN, params.EVAL) AND
trip_distance > 2 AND
passenger_count > 2 AND
fare_amount > 3 AND
ABS(pickup_longitude) <= 90 AND pickup_longitude != 0 AND
ABS(pickup_latitude) <= 90 AND pickup_latitude != 0 AND
ABS(dropoff_longitude) <= 90 AND dropoff_longitude != 0 AND
ABS(dropoff_latitude) <= 90 AND dropoff_latitude != 0 AND
ST_Distance(ST_GeogPoint(pickup_longitude, pickup_latitude),
ST_GeogPoint(dropoff_longitude, dropoff_latitude)) > 2
)
SELECT * FROM taxitrips
Notes about the above query:
- We’re now calculating
trip_distance
, from pickup and dropoff locations. - In the
WHERE
clause, I’m including only rows where both thehistorical_taxi_rides_raw.trip_distance
and the calculatedtrip_distance
are> 2
. - Finally, I’m storing the results in a new table, with the name required.
Here’s the schema of the new table:
And here’s what its data looks like:
4. Create the Model
Now we’re ready to create the model.
Since we’ve already got the cleaned data in a new table, we can simply use that data to create the model. We’re trying to predict numeric values, rather than make boolean predictions. So it makes sense to create a linear regression model using BigQuery ML.
CREATE or REPLACE MODEL taxirides.fare_model_926
OPTIONS
(model_type='linear_reg', labels=['fare_amount_797']) AS
WITH
params AS (
SELECT
0 AS TRAIN,
1 AS EVAL
),
training_data AS (
SELECT * EXCEPT(pickup_datetime, TRAIN, EVAL)
FROM `taxirides.taxi_training_data_388`, params
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 2) = params.TRAIN
)
SELECT * FROM training_data
Note how I’m still using the params
enumeration. But when building the model, I’m only using half the data; i.e. the half where mod 2 == 0
.
The model takes a couple of minutes to build. If we then open the model and click on the Evaluate tab:
5. Evaluate the Model
We’re told that our model needs have RMSE (Root Mean Squared Error) of less than 10. Clearly, we’ve already achieved this goal, as seen from the screenshot above. I.e. we have a mean squared error
of just over 10; so the square root of this value is going to be a little over 3. But let’s run a query, to be sure:
SELECT
*,
SQRT(mean_squared_error) as rmse
FROM ML.EVALUATE(MODEL taxirides.fare_model_926, (
WITH
params AS (
SELECT
0 AS TRAIN,
1 AS EVAL
),
training_data AS (
SELECT *
FROM `taxirides.taxi_training_data_388`, params
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 2) = params.EVAL
)
SELECT * FROM training_data
))
You can see that this query is very similar to the query that I used to build the model. But this time, I’m using the other half of the training data to evaluate the model.
Here’s the result:
You can see that the RMSE is 3.4
. So this is good! If our RMSE was over 10, we would have to do more feature engineering to refine our model.
6. Make Predictions
Finally we’re ready to predict total fare for all the rows in our test data:
CREATE OR REPLACE TABLE taxirides.2015_fare_amount_predictions AS
SELECT * FROM ML.PREDICT(MODEL taxirides.fare_model_926, (
WITH
daynames AS (
SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek
),
test_data AS (
SELECT
daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
pickuplon,
pickuplat,
dropofflon,
dropofflat,
ST_Distance(ST_GeogPoint(pickuplon, pickuplat),
ST_GeogPoint(dropofflon, dropofflat)) AS trip_distance,
passengers
FROM `taxirides.report_prediction_data`, daynames
)
SELECT * FROM test_data
)
)
In the query above we’re applying the model to the report_prediction_data
table, and storing the results in a new table.
The it looks like this:
And that’s it. We’re done!
Before You Go
- Please share this with anyone that you think will be interested. It might help them, and it really helps me!
- Feel free to leave a comment 💬.
- Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons: