New York City Taxi Trip Duration Prediction

Published in

CUNY CSI MTH513

8 min readMay 2, 2022

Introduction

My team and I built our final solution for the competition named: “New York City Taxi Trip Duration”. Our challenge was to build a model that predicts the total ride duration of taxi trips in New York City. The NYC Taxi and Limousine Commission provided our primary dataset, which includes the following variables:

Working with the dataset

We approached this competition from the dataset, looking for data that could have a greater impact to any car trip, and we immediately thought about the weather. After some research we finally found a reliable dataset with the containing weather information of more than ten years. Following are the variables in the weather dataset:

Comparing our original dataset to the new weather dataset, we discovered that the two datasets cover distinct time periods.

weather["DATE"] = pd.to_datetime(weather["DATE"])
weather['year'] = weather['DATE'].dt.year
weather_2016 = weather[weather["year"]== 2016]
weather_2016.drop(['STATION',"NAME","year"], axis = 1, inplace = True)

We picked the year from the date using the previous line of code to keep only the rows that we are interested in (year 2016). After that, using the date as a key, we can easily perform a left merge that will keep every row in the left dataframe.

left_merge = pd.merge(left=train, right = weather_2016, on = "DATE", how="left")
left_merge_test = pd.merge(left=test, right = weather_2016, on = "DATE", how="left")
train = left_merge.loc[:, left_merge.columns != 'DATE']
test = left_merge_test.loc[:, left_merge_test.columns != 'DATE']

We now have all the weather data in our dataset, but we only want to keep what we really need for our final model to avoid overfitting. We added two new variables called “good_weather” and “t_mean”, instead of preserving all the weather variables (precipitation, snow, minimum temperature, maximum temperature, and average wind speed).

train["good_weather"] = ((train['PRCP'] == 0) & (train['SNOW'] == 0))
train.drop(['AWND', 'PRCP', 'SNOW'], axis=1, inplace=True)
train["t_mean"] = ((train['TMAX']) + (train['TMIN']))/ 2
train.drop(['TMAX', 'TMIN'], axis=1, inplace=True)

The “good_weather” variable will tell us if there will be rain or snow, and the “t_mean” variable will give us the average of the maximum and minimum temperatures. We’re not done with the weather since we want to apply the label encoder for our “ good_weather “ variable so that our column only contains 1s and 0s.

encoder.fit(train['good_weather'])
train['good_weather'] = encoder.transform(train['good_weather'])
test['good_weather'] = encoder.transform(test['good_weather'])

Our work with the weather dataset is done now, we can start focusing on two other main aspects of our model. That is, the average speed of the trip and the rush hours. If the average speed exceeds the speed limit, then we do not want to consider that trip, because we only care about legal trips. To calculate the speed, we first need to calculate space and time. The space will be the distance from the pickup position to the drop-off position divided by the duration of the trip. To find the distance will use the haversine, which is a formula that determines the great circle distance between two points on a sphere given their longitudes and latitudes, and it is very important in navigation.

from math import radians, cos, sin, asin, sqrtdef haversine(row):
    lon1 = row['pickup_longitude']
    lat1 = row['pickup_latitude']
    lon2 = row['dropoff_longitude']
    lat2 = row['dropoff_latitude']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6369 * c
    return km

We apply the formula to find our distance:

train['distance'] = train.apply(haversine, axis = 1)

Now that we have the distance and the duration (trip_duration), we can proceed calculating the speed:

train['speed'] = train.distance / duration * 2236.936292

We multiplied by 2236.936292 to convert from km\second to miles\hour.

Now we can drop the speed that exceed the limit:

#NYS max speed limit 65mph
train = train[(train.speed < 65)]
train.drop(['speed'], axis = 1, inplace = True)

As I previously mentioned we also decided to determine the rush hours in a day (time of the day with more traffic):

def rush_hour_f(row):
    rhour = row['real_hour']
    if (6 <= rhour) & (rhour <= 10):
        return 1
    if (10 < rhour) & (rhour < 16):
        return 2
    if (16 <= rhour) & (rhour <= 20):
        return 3
    return 0

We also decided to create a new variable named ”is_weekend” so, instead of dealing with 7 different days, we only deal with two groups of days (Monday to Thursday and Friday to Sunday) which is what we care in terms of traffic.

train['is_weekend'] = train['weekday'] > 4
encoder.fit(train['is_weekend'])
train['is_weekend'] = encoder.transform(train['is_weekend'])

At this point we decided to make a graph which will show us all the position of the pickups and drop-offs looking for anomalies. In the following graph we will have the pickups in blue and the drop-offs in orange.

Code for the graph:

# Create a map
m = folium.Map(location=[40.7, -74], tiles='openstreetmap', zoom_start=4)# Add points to the map
#Change head(n) for number of points
for idx, row in outlier_pickup.head(100).iterrows():
    Marker([row['pickup_latitude'], row['pickup_longitude']], popup='Pickup', icon=folium.Icon(color='blue')).add_to(m)for idx, row in outlier_dropoff.head(100).iterrows():
    Marker([row['dropoff_latitude'], row['dropoff_longitude']], popup='Dropoff', icon=folium.Icon(color='orange')).add_to(m)

Graph:

Based on this graph we noticed that some of the trips of our train dataset are unexpected and far from the others. These are some case of very long trips that we do not want to consider in our train dataset, so we will drop them. Probably there will be some outliers also in our test dataset, but we will only drop the ones of the train dataset because for testing we care about any scenario.

train = train[(train.trip_duration < 1000000)]
train = train[train['pickup_longitude'].between(-75, -73)]
train = train[train['pickup_latitude'].between(40, 42)]
train = train[train['dropoff_longitude'].between(-75, -73)]
train = train[train['dropoff_latitude'].between(40, 42)]
duration = train['trip_duration']
train['trip_duration'] = np.log(train['trip_duration'].values)

We just selected the passengers that are part of our longitude and latitude points of interest, and the outliers will be automatically dropped.

Models

Our work with the dataset is done now, from this point our main attention will be on the creation of the right model to implement the prediction. The main considerations that we made before choosing the models were the following. The first one was prioritizing algorithms that maximize the performance of the model. The second one was the model complexity, because even if a higher level of complexity can lead to greater performance, it also involves larger costs. Also, we wanted to be able to explain what was happening in our model, and the more complex the model is, the harder it is to explain. The dataset size was another important consideration because the amount of training data available is one of the main factors to consider when choosing a model. The last consideration was the training time and the cost. We tried to balance time, cost, and performance as better as we could. After discussing all these considerations, we first decided to use a Random Forest Regressor. To use that, and for any model that we decided to use, we first needed to split the dataset into training and validation sets to evaluate how well our model performs. The train set is used to fit the model, and the statistics of the train set are known. The second set is called the test data set, and this set is solely used for prediction.

X_train, X_testing, y_train, y_testing = train_test_split(X, Y, test_size = 0.01, random_state = 42)

The following lines of code are the procedure that we used to create the model.

rf = RandomForestRegressor(random_state = 42, n_estimator = 50)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_testing)

We opted to utilize 50 estimators for our model. We created our prediction after fitting the model with the training data. We were really pleased with this approach because it produced a good result in about 19 minutes and 38 seconds (which is not too bad).

Here is the final result of our prediction:

Why is this a good result? This competition works that the closer your prediction is to 0 the more accurate it is. Before starting to build a model, we checked the range of the result of a lot of solutions, and in the leaderboard, it varies between 0.28976 (first place) and 6.51592 (last place). So, we are looking for a low number as result of our model.

Even though we were satisfied, we decided to continue our search for the ideal model. We attempted using XGBoost and LightGBM. XGBoost is a decision-tree-based ensemble Machine Learning method that uses a gradient boosting framework. LightGBM is a distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification, and many other machine learning tasks. Both the algorithms performed pretty well, and they were almost equivalent in terms of performances, but LightGBM was significantly faster. This is not the only reason of our choice, because we mainly chose this model for all the following advantages that it offers:

· Faster training speed and higher efficiency.

· Lower memory usage.

· Better accuracy.

· Support of parallel, distributed, and GPU learning.

· Capable of handling large-scale data.

We imported LightGBM and we trained it using the best parameters.

#~250s
import lightgbm as lgblgb_params = {
    'learning_rate': 0.1,
    'max_depth': 25,
    'num_leaves': 1000, 
    'objective': 'regression',
    'feature_fraction': 0.9,
    'bagging_fraction': 0.5,
    'max_bin': 1000 }#Training on all labeled data using the best parameters
lgb_df = lgb.Dataset(X_var, Y_var)
lgb_model = lgb.train(lgb_params, lgb_df, num_boost_round=1500)

We implemented our prediction:

y_pred = lgb_model.predict(X_testing)

We created the submission using the ID of the test dataset:

submission = pd.DataFrame({'id': test.id, 'trip_duration': np.exp(y_pred)})
submission.to_csv('submission.csv', index=False)
submission.head()

And we finally got our best result.

Final considerations

Why this is our best result?

In terms of performance 0.37919 is a great result, especially if we compared it to our previous result 0.40658. There is a margin of improvement of 0.02739.

In terms of time, we have a margin improvement of 8.8 minutes because the predictions run in 11 min and 15 seconds.

In terms of complexity, we must consider that it produces much more complex trees by following leaf wise split approach rather than a level-wise approach. This could sometimes lead to overfitting, but we avoided it by setting the max_depth parameter.

In terms of size, we can say that LightGBM sometime could overfit the small dataset, but it can handle big datasets very well. Our dataset in this example isn’t enormous, but it’s far from little, which is why it performed very well.

Our final result of 0.37919 would have placed our team 193rd in a leaderboard of 1254 predictions.

New York City Taxi Trip Duration Prediction

Introduction

Working with the dataset

Models

Final considerations

Written by Fabio Pecora