Building a machine learning ensemble classifier on NY taxi data to predict no tips vs generous tips with Python & Google BigQuery

I demonstrate the power of the Google BigQuery engine by building a classifier which will predict whether a NY city taxi ride will result in a generous tip or no tip at all. As part of doing this I explore the dataset and look at relationships in the dataset. I also visualize the pickups around the city and the result is a scatterplot which essentially draws the city streets of NY.

I feature the BigQuery UI, the python API, and pandas libraries related to executing SQL queries directly from the Jupiter notebook and sklearn. I describe challenges overcome within BigQuery, including some syntax differences between queries executed via the API and via the UI. Alot of the online documentation appears out of date and therefore my sample code should prove a useful resource to execute one’s own projects.

Finally, for those who are interested in a building a classifier and meta-classifier, I do so using simple to understand labels and inputs which should prove as a helpful reference for people looking to implement their own classifiers. I also demonstrate how to deal with missing values.

Background on BigQuery

BigQuery is the public implementation of an internally used querying service within Google named Dremel.

Dremel can scan 35 billion rows without an index in tens of seconds. Dremel, the cloud-powered massively parallel query service, shares Google’s infrastructure, so it can parallelize each query and run it on tens of thousands of servers simultaneously.

The first step in using BigQuery is to authorize a google cloud account. This is done by visiting and following the authorization agreements. You will also need to provide a credit card for billing purposes. When signing up to begin with you will receive $300 worth of credits which is more than enough for an exploration and running of sample queries.

You will then be logged into the google cloud dashboard and the next step is to create a project (you can have multiple).

You then access bigquery by clicking the menu on the left hand side and choosing “BigQuery”

This will then load the BigQuery WebUI

We see that there are 130GB of data about NY taxi trips and 1.1 Billion rows. It only takes BigQuery approx 30 seconds to process all of this data.

I first import pandas as within pandas there is a method specifically designed to query bigquery and returns the results as a pandas dataframe.

I also import matplotlib for visualizations.

The data I am using is already stored within the google cloud and is available to all — I use this for demo purposes as anyone can access it and run the queries.

As an initial demonstration and exploration of the data I show the ability to query by passing SQL code directly into the pandas method by looking at the number of taxi trips is in data by month.

I then explore the data and examine the relationships between tips and different features.

I then aggregate the number of pickups at each long and lat coordinate in NY

I then am able to feed this data into a scatterplot which results in reasonably clear outline of NY based on where taxis have picked people up.

Given my data exploration, I see that only 3% of trips paid with credit cards result in no tip so I decided for classification purposes we would try and predict the two extremes which require manual overrides by people in a NY taxi — no tip and very generous tips (“Generous” being above 30% of the fare)

I then go about extracting the relevant information and labeling it as such.

I create two separate datasets to begin with, do the labels in each and then combine them.

I balance the data so half of the data will be generous and the other half no tips so that my classifier only has to do better than 50% to be considered useful.

Having now gotten my labels (Y) and features (X), I build the classifier. I use sklearns classifiers KNN, Niave Bayes, logistic regression and decision trees and I use their outputs to feed an ensemble or “meta-classifier” which then gives the result.

When we run the training and tests, we see the following results:

Accuracy: 0.67 (+/- 0.12) [DecisionTreeClassifier]
 Accuracy: 0.91 (+/- 0.14) [LogReg]
 Accuracy: 0.77 (+/- 0.10) [KNN]
 Accuracy: 0.67 (+/- 0.01) [NB]
 Accuracy: 0.83 (+/- 0.12) [Ensemble]


Given the above results, I have been able to successfully build a meta-classifier that will predict where a trip is more likely to result in either no tip or a “generous” tip for the driver.

Classifiers such as this may be useful in catching fraud. For example, this data is using only credit card transactions as all tip information is captured but when using cash it is likely drivers do not report tips due to tax consequences — classifiers such as the one built can be used to understand when we should see tips being recorded, or not, and identify anomalies.

Github Repo: