Feature Engineering with Geospatial Data: Predicting NYC Cab Trip Duration
Mobility data has surged in popularity recently due to COVID-19, so I wanted to work on a prediction problem involving geospatial data. I decided to tackle the NYC Cab Trip Duration Kaggle competition, where the objective is to predict trip duration of NYC cab rides given primarily geospatial and temporal features.
Using a LightGBM model, I was able to achieve a RMSLE score of 0.38109, which would put me in position #177 of 1254 entries on the public leaderboard (but Kaggle doesn’t publish late submission scores on the leaderboard) or the top 14th percentile!
In this article, I will outline the six steps taken to arrive at the final model, with a major focus on geospatial feature engineering.
About the Data
The original train dataset from Kaggle contains almost 1.5 million cab rides in New York City taken from January 1 to June 30 2016. The dataset has 11 columns, broken down by category:
- Target/ label column: [
trip_duration
] - Geospatial data: [
pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude
]. - Temporal data: [
pickup_datetime, dropoff_datetime
]. I dropped thedropoff_datetime
column when training the model for fear of leakage.