Sitemap
Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Feature Engineering with Geospatial Data: Predicting NYC Cab Trip Duration

9 min readJun 20, 2020

--

Mobility data has surged in popularity recently due to COVID-19, so I wanted to work on a prediction problem involving geospatial data. I decided to tackle the NYC Cab Trip Duration Kaggle competition, where the objective is to predict trip duration of NYC cab rides given primarily geospatial and temporal features.

Using a LightGBM model, I was able to achieve a RMSLE score of 0.38109, which would put me in position #177 of 1254 entries on the public leaderboard (but Kaggle doesn’t publish late submission scores on the leaderboard) or the top 14th percentile!

Press enter or click to view image in full size
RMSLE score of predictions scored by Kaggle

In this article, I will outline the six steps taken to arrive at the final model, with a major focus on geospatial feature engineering.

About the Data

The original train dataset from Kaggle contains almost 1.5 million cab rides in New York City taken from January 1 to June 30 2016. The dataset has 11 columns, broken down by category:

  • Target/ label column: [trip_duration]
  • Geospatial data: [pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude].
  • Temporal data: [pickup_datetime, dropoff_datetime]. I dropped the dropoff_datetime column when training the model for fear of leakage.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Claudia Ng
Claudia Ng

Written by Claudia Ng

Data Scientist | FinTech | Language Enthusiast

No responses yet