Predicting taxi fare with BigQuery ML

Florent Legras
CodeShake
Published in
5 min readSep 6, 2018

Bigquery comes with a new feature named BigQuery ML allowing to create machine learning models in SQL. For now two model architectures can be used, Linear regression for regression problem and Logistic regression for classification. Today we will use Bigquery ML on a well known dataset “new york taxi trips”. Our goal is to predict the trip price. To do so we will create a model and train it on 2014 and 2015 data and validate our model on 2016 trips. Train dataset is composed of ± 39Go of data (300 million rows) and the validation set contains 15Gb (130 millions rows).

Analyse Data:

Before we start to create model we need to analyse the data. If you look at the data you can see that tips are included into total_amount.

Analyse amount data from public data set.

We don’t want to include tips in our model, because it’s too random, so we use tolls_amount + fare_amount for total price.

Now let’s check outliers data.

Ok, we get strange latitudes and longitudes for New York taxi rides. For our use case we will take only taxi rides in latitude [40°,42°] and longitude [-75°,-73°]. In addition we will take only prices greater than $0 and lower than $1500.

Map of [40°,42°] latitudes and [-75°,-73°] longitudes

Benchmark

To start our exploration we need to do a benchmark to see if we perform better than a simple hypothesis. For our benchmark we choose to calculate the average cost per distance and predict for each trip the cost with this formula:

taxiPrice = euclidianDist *avgCostPerDist

We got a 53.32 in MSE (Mean Square Error). This means that when we predict with our benchmark model we got a average of $7.3 difference with truth.

We got our initial benchmark.

Benchmark = $7.3

First Model

Now we can start to use BigQuery ML to try to create a better model than our initial benchmark. For the first model we start simple. We just create a Linear Regression over the features without doing anything more.

Features are:

  • Euclidean distance between pickup and dropoff.
  • Longitude distance
  • Latitude distance
  • Day of week
  • Hour
  • Passengers

We have :

  1. CREATE MODEL statement, with the path where the model will be saved.
  2. The options specify the algorithm — in this case, a linear regression algorithm, with total_amount being the label

About 10 minutes later, the model is trained and evaluation results have been populated for each iteration:

First model stats

That pretty cool with BigqueryML you can train and evaluate 5 models in 10 minutes for a dataset of 39Go. Bigquery stop creating model when the gain of new model is less than 0.1 % or after the 10th model created (Can be change with create model parameters).

The last model has an Evaluation data loss (MSE) of 25.9 approximately $5.08. We are better than our benchmark at the first attempt!

To be sure let’s check our model with validation dataset

We have a MSE of 26.75 and RMSE of $5.17.

Simple Model = $5.17

Second Model:

Now we will add features into our model to help it perform better. We cheat a little and check on Kaggle which features can be used. This Notebook is a great start to exploring data and add new features.

We add places of interest and compute distance between those places and the taxi trip.

In addition, we choose to add weather data. You can get this data from bigquery public dataset. Let’s check if we perform better with all these new features.

Second model stats

Ok now with all our features we have 21.8489 in MSE or $4.67. That great!!! Let’s check with the validation set:

We have the same range of performances on the validation set, this is great!

Second model= $4.66

Now let’s check error distribution for our model. As we can see 80% of prediction errors are between $-3 and $4.

BigQuery ML tips:

Identifying categorical data:

If you have categorical data (like day number in our case) casting it to String allows BQML to identify this feature to be one-hot-encoded. BigQuery ML automatically one-hot-encoding some data types (BOOL, STRING, BYTES, DATE, DATETIME, or TIME).

Choose your evaluation metric:

When you perform a ML.EVALUATE() on your model, Bigquery gives many metrics.

For our use case we choose MSE to evaluate our model. But others metrics can be more appropriate in some cases. We won’t describe all of them and their pros and cons, but it’s important to know that they exist.

Use regularization:

In our example we didn’t need to do regularization when we train our model. But it’s possible to do L1 or L2 regularization.

Conclusion:

Creating a model in Bigquery ML is really easy, just a few SQL requests are required and our model is trained and ready to predict. At the end, we have a model which performs well for a linear problem. Bigquery ML simplies Machine Learning by allowing all SQL developers to create ML models.

In our use case we can continue to improve the model by adding new features but with a linear model it will be difficult to significantly improve performance. One possible update will be to do a model per day instead of a general model. For example, we tried a model dedicated for the Tuesdays and we observed better performances.

--

--

Florent Legras
CodeShake

Engineering Director Data @Sfeir, Curious about all kind of things