Predicting New York City Taxi Fare Price and Shiny Dashboard

The NYC Taxi Fare App

I developed an interactive map with statistics and graphs to visualize the distribution of the variables and geographic distribution of taxi rides across the city.

Link to Shiny App: NYC Taxi Fare Shiny App

Introduction

Kaggle’s New York City Taxi Fare Prediction Competition tasked its participants with predicting the fare amount for a taxi ride in New York City given the pickup and dropoff locations. While distance can give a rough estimate and is an obvious important predictor, the RMSE depending on the model will be between $5-$8.

This dataset is quite large: over 55 million rows. Through working on this competition participants learn how to handle large datasets and become familiar with and solve problems using cloud computing services and/or TensorFlow for deep learning models. The evaluation metric for this competition is the RMSE score. This project involved data preprocessing, data cleaning and modeling a large dataset.

Data

The training set contains 55 millions rows (5GB) with 6 predictor variables and one target variable, fare_amount in USD. The test set contains 10,000 rows. The features are:

  1. pickup_datetime
  2. pickup_longitude
  3. pickup_latitude
  4. dropoff_longitude
  5. dropoff_latitude
  6. passenger_count

Project Overview

Data Cleaning and Preprocessing

Summary of Data

The original training set contained rows with extreme or inaccurate values: passenger count greater than 200, fares costing over $1000 and rides with zero passengers.

All predictor variables contained missing values. Because of the training set is more than 1000 times larger than the test set, removing outliers and rows with negative fare amounts or missing values did not cause overfitting and improved performance.

Distributions of Fare Amount and Passenger Count from NYC Taxi Fare Shiny App

Rides with fare amounts greater than $200 were removed. Many rows contained longitude and latitude coordinates in the ocean and not on land, these rows were removed from the training set. Because the large size of the dataset, the data had to be cleaned in chunks, after cleaning, each chunk was appended to a new csv.

Feature Engineering

I added 13 new features. Since this dataset is quite large, adding new features resulted in a rapidly growing dataset and memory usage. Ordinal and numerical features were converted to data type float32 to save memory.

Many people when traveling take trips by taxi from or to airports instead of driving. Taxis have regulated fares to and from the Newark Liberty International Airport (EWR) and John F. Kennedy International airports (JFK). For trips between EWR and New York City, the price is the regular metered fare, plus a surcharge of $17.50 and tolls. Trips between JFK and Manhattan have a flat fare of $52 plus tolls. The regular metered fare which is based on a combination of time and distance applies to all trips to and from LaGuardia International Airport (LGA). Trips to and from airports tend to be higher.

New features

distance: Distance between pickup and dropoff locations.

dropoff_jhk: Distance between dropoff location and JFK airport. pickup_jhk: Distance between pickup location and JFK airport.

dropoff_ewr: Distance between dropoff location and EWR airport. pickup_ewr: Distance between pickup location and EWR airport.

dropoff_lga: Distance between dropoff location and LGA airport. pickup_lga: Distance between pickup location and LGA airport.

year, month, day of month, hour, day of week: Year, month, day of month, hour of pickup date and time.

Modeling

Correlation Heatmap

The correlations between the features and the target are very low. Nonlinear models such as tree based models do better in such cases where none of the predictors are highly correlated with the target variable. I chose LightGBM for its speed and performance. LightGBM is 10–8 times faster than XGBoost. Because of lightGBM’s efficiency, training the model on 55 million rows took less than 3 hours. I used the pickle library to save my model as binary file for convenience.

In order to load all the data into memory, train and build a model, I used paperspace, an affordable high-performance cloud computing and ML development platform for building, training and deploying machine learning models.

Feature Importance (LightGBM)

Distance is the most important feature. The most important features are highly related and not independent: distance is derived from pickup latitude, dropoff latitude, pickup longitude and dropoff longitude. Since most taxis operate in Manhattan, most trips to and from airports are going to be between an airport and Manhattan. Passenger count, the day of week, month and year are as important for predicting fare amount.

Unfortunately, there is not enough RAM even on C7 server (30GB) to train a LightGBM model on 15GB+of data. I was able to train it on 70% of training data. I used the test set as the validation set in order to implement early stopping. The final model had a RMSE of 3.28023 on the C7 server however, the RMSE score on Kaggle was 2.88475.

Conclusions

The LightGBM model did very well. If the model had been trained with the entire training set, performance would have been even better. LightGBM did well even without Cross-Validation with parameter tuning using Grid Search which would have taken days to return the best parameters estimates.

Further Improvements

Adding a neighborhood feature and using unsupervised methods to clustering similar data points may improve accuracy.