Feature Engineering with Geospatial Data: Predicting NYC Cab Trip Duration

Claudia Ng
Analytics Vidhya
Published in
9 min readJun 20, 2020

--

Mobility data has surged in popularity recently due to COVID-19, so I wanted to work on a prediction problem involving geospatial data. I decided to tackle the NYC Cab Trip Duration Kaggle competition, where the objective is to predict trip duration of NYC cab rides given primarily geospatial and temporal features.

Using a LightGBM model, I was able to achieve a RMSLE score of 0.38109, which would put me in position #177 of 1254 entries on the public leaderboard (but Kaggle doesn’t publish late submission scores on the leaderboard) or the top 14th percentile!

RMSLE score of predictions scored by Kaggle

In this article, I will outline the six steps taken to arrive at the final model, with a major focus on geospatial feature engineering.

About the Data

The original train dataset from Kaggle contains almost 1.5 million cab rides in New York City taken from January 1 to June 30 2016. The dataset has 11 columns, broken down by category:

  • Target/ label column: [trip_duration]
  • Geospatial data: [pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude].
  • Temporal data: [pickup_datetime, dropoff_datetime]. I dropped the dropoff_datetime column when training the model for fear of leakage.

--

--

Claudia Ng
Analytics Vidhya

Data Scientist | FinTech | Harvard MPP | Language Enthusiast