Taking on the Kaggle Taxi Challenge

Wenqin Ye

Published in

Wenqin’s Blog!

4 min readAug 4, 2017

New York City Taxi Trip Duration

Share code and data to improve ride time predictions

www.kaggle.com

I got an email about Kaggle’s taxi challenge and normally I would be hesitant to join these competitions but this one seemed really inviting because it rewards people for contributing helpful kernels that I and others can learn from! Here I will share my own ideas and what I’ve learned so far.

If you are not familiar with the problem the goal is to predict the trip time for a taxi ride given information such as the trip’s start location, end location, the datetime of the trip and thenumber of passenger.

I noticed a lot of kernels started with data exploration so that is what I did as well. One aspect of the data that I looked at was the relationship between between the distance of the trip and the duration. Intuitively longer distances should result in longer trip times but in fact there was only a weak correlation between distance and trip time. In reality many other factors, especially at short distances, are important for trip time such as the weather, or presence of traffic congestion. All of which are hidden variables not given in the data.

Idea 1: Modeling hidden variables with changes in speed

I had the idea of letting the neural network represent the hidden variables by outputting adjustments to some base speed and then using the adjusted speed to calculate the trip time using the formula:

time = trip_distance/adjusted_speed.

For example the neural network might have 10 output neurons:

[-1, -2, 3, 1, 2, 4, 0.5, 0, 0.2, 1] -> sum = 8.7 km/h

The speed adjustments would be (assuming a base rate of 10km/h)

10km/h + 8.7km/h = 18.7km/h.

Predicted trip duration (assuming 5km trip):

5/ 18.7 = 0.26 hours or 962 seconds.

One of the problems with this design is that error gradients tended to be very large. This is evident for example if the network predicts a time of 650 seconds and the actual time is 750 seconds, resulting in an error signal of 100. This will cause the weights to be unstable and cause the outputs and weights to explode.

Idea 2: Regression to Classification

To counter large error signals I turned the network from a regression problem to a classification problem. I took the max trip time in the data set (which was 5060 seconds) and split the time into 91 buckets, each representing a 60 second interval. The new objective of the neural network was to predict the time bucket that the taxi trip belongs to and because this is classification and not regression the error signals won’t be massive and the cross entropy loss could be used.

To further make things more even I normalized the inputs to zero mean and unit variance.

After training the network for a couple minutes on my Macbook Pro I got on average a classification accuracy of 10% for a set of 700 000 examples, which is better than the 0.39% accuracy of random guessing but still leaves much to be desired.

Left column = predicted time. Right column = actual time

After this I was pretty much stumped. But then one night I was thinking about using an autoenocder and about hidden activations and then I had the idea to look at the activations of the output neurons to see if there was anything interesting and it sure was interesting!

Here were some of the results (the x axis represent each output neuron, and the y axis represents the activation):

I was expecting the outputs to have a single spike at the actual time bucket and then be close to zero everywhere else. In reality the neural network naturally outputs a distribution of likely time buckets with a peak at the most likely time bucket.

Why did it does this? My theory was that the network outputs a relative plot of what the most common times are given the features. Which means that there are a lot of examples with similar features but varying times. This suggests that there are many more unaccounted for features that would make the neural net more accurate.

What those features are I am still yet to figure out. I’m getting close, but still not quite there. I’ll just have to keep trying new ideas until something works out!

Check out my jupyter notebook for the source code I used:

wenqinYe/kaggle-nyc-taxi-data

Contribute to kaggle-nyc-taxi-data development by creating an account on GitHub.

github.com