Linear Regression Model on the New York Taxi Trip Duration Dataset using Python

Published in

Analytics Vidhya

8 min readSep 19, 2019

Anuradha took the Applied Machine Learning course and presents her project on the popular NYC Taxi Trip Duration dataset.

This is a continuation of a previous Exploratory Data Analysis story on the same data set. If you haven’t checked it out I would highly recommend you to check that out first. It can be found here. The data set used can be downloaded from here

Here, we will carry on from where we left.

A quick recap of the EDA done.

We loaded the relevant libraries and data set.
We identified the categorical and continuous features.
We did the uni variate analysis of some of the columns and bi variate of some with respect to the target column trip_duration.We identified some outliers through these processes and imputed them with relevant values.
We reached some conclusion based on our EDA.

Here we will continue the transformation and analysis of some of the features to identify anomalous data points. We will also conduct some outlier treatments.We will impute or drop them before we move into the final model building stage.

We will also import new Python Libraries when and where needed.

Lets look at some of the features first:

Trip_Duration

We will have a look at the largest and smallest occurring Trip_Duration values.

print('The value of largest 5 trip duration values are as follows : \n {} '.format(df['trip_duration'].nlargest(5)))
print('The the number of rows with 1 as their trip duration values is {}'.format(len(df[df['trip_duration']==1 ])))

We see that there is 1 very large value and 13 values with 1 second as its duration which is absurd. Hence we are dropping these rows.

df=df[df.trip_duration!=df.trip_duration.max()]
df=df[df.trip_duration!=df.trip_duration.min()]

We will now create another column with the trip_duration represented in hours. This will be later used for finding out the speed of each trips

df['trip_duration_hour']=df['trip_duration']/3600

Passenger Count

We will have a look at the passenger count frequencies

df.passenger_count.value_counts()

Here the no of records with passenger count 0,9 and 7 are very small compared to the entire data set. Hence, we will drop the values.

df=df[df.passenger_count<=6]
df=df[df.passenger_count!=0]

Pickup_datetime and Dropoff_datetime

We have already converted these 2 columns into datetime type in the last story. Now we create new columns depicting the month and day of the week the particular trip took place.

df['pickup_day']=df['pickup_datetime'].dt.day_name()
df['dropoff_day']=df['dropoff_datetime'].dt.day_name()
df['pickup_month']=df['pickup_datetime'].dt.month
df['dropoff_month']=df['dropoff_datetime'].dt.month

Lets have a look at the distribution of the pickup and drop off months distributions

df['pickup_month'].value_counts()

df['dropoff_month'].value_counts()

All the months has uniform distribution of trips.No data is present for pickup months beyond June. There are few data present in July for drop off months. It may be outlier as well. We will have to look into that.

For the drop offs done in July we will find the frequency distribution of the corresponding pickup month. We find the corresponding date as well.

print(df[df.dropoff_month==7].pickup_datetime.dt.month.value_counts())
print(df[df.dropoff_month==7].pickup_datetime.dt.day.value_counts())

Thus we see that all the pickups were done on 30th June for drop offs on July. So the data seems fine.

Trip Distance, Speed, Time

We are creating a function which returns the distance between a pair of latitudes and longitudes using the haversine distance formula.

#a function is created to calculate the distance from latitudes and longitudes
from math import radians, cos, sin, asin, sqrtdef haversine(df):
    lat1, lon1, lat2, lon2 = df.pickup_latitude,df.pickup_longitude,df.dropoff_latitude,df.dropoff_longitude 
    R = 3959.87433 # this is in miles.  For Earth radius in kilometers use 6372.8 kmdLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))return R * c

We will apply this function to each of the rows and create a new feature distance which stores the distance between the pickup and dropoff points in kilometers.

df['distance'] = df.apply(lambda x: haversine(x), axis = 1)

We will have a look at the distribution of this distance feature against the trip_duration value.

sns.scatterplot(x='distance',y='trip_duration',data=df)

We can see several outliers with values much beyond 200km and many values with trip_distance = 0km. These may be the rows depicting cancelled rides. Lets have a look at how many such rides are there.

print('The no of rows with distance =0 are {}'.format(len(df[df.distance==0])))

That’s quite a number! We will not drop these rows. Rather we will replace these datas with the average distance

mean_dist=df['distance'].mean()
df.loc[df['distance']==0,'distance']=mean_dist

We will now create a new feature called speed. This will help us in identifying data points where time taken and distance covered does not match up. We will also have a look at the distribution of trip speed.

df['speed']=df['distance']/df['trip_duration_hour']
sns.boxplot(df['speed'])

Here we see several outliers. The average speed of a taxi in New York City is about 11 km/hour. The data has several data points with a speed way beyond that.

We will now have a look at the distribution of the distance variable against the trip duration in hour feature.

sns.scatterplot(x='distance',y='trip_duration_hour',data=df)

Here we see several data points where the distance is < 20 km and the time taken to be >10 hours. This is very absurd as the avg speed is 11 km/hour.These might be due to extreme road congestion. Lets log transform these columns and have a look at the distribution again.

df['log_distance']=np.log(df.distance)
df['log_trip_duration']=np.log(df.trip_duration_hour)sns.scatterplot(x='log_distance',y='log_trip_duration',data=df)

Here we see that the log transformed value of trip duration and distance has a somewhat linear relationship. But still there are some anomalous data points where the duration value is not changing even with the change in distance.

We will thus drop the rows beyond log_trip_duration > 2

df=df[df.log_trip_duration<2]

Since we have added several columns to our data set right now lets have a look at them.

df.columns

Now, we won’t use all of them to build our model as this would make the model very complex. We create a new data frame data2 to select only the features which had some effect on the target variable trip_duration.We dropped certain features as they were transformed to other features. We dropped the nominal features as well.

eg: latitude longitudes were converted to distance,pickup and drop off datetime were converted corresponding months and weekdays etc.

data2=df.loc[:,['passenger_count','store_and_fwd_flag','trip_duration', 'pickup_day', 'dropoff_day', 'pickup_month',
   'dropoff_month','pickup_timezone','dropoff_timezone','speed','log_distance','distance']]

We will now transform the categorical features from data2 dataframe through one hot encoding.

data2=pd.get_dummies(data2,columns=['store_and_fwd_flag','pickup_day','dropoff_day','pickup_month','dropoff_month','pickup_timezone', 'dropoff_timezone'])

Now we will have a look at the correlation heatmap between each features so that it is easier for us to choose the best features

Thus we see that some features has high correlation with other features and some are not correlated at all.

First we will create a model with the mean of trip duration as the prediction. Then we will create a base line model with only distance and it has a correlation > 5 with trip_duration. Next, we will choose the other features which are positively correlated with trip_duration and create the third model.

We will split our data into 2 parts. The first part we will use to train our data and the 2nd part will be used for testing.

Within the first part we will use K-Fold cross validation using this. (k=20)

We are defining the baseline model columns, columns to be used in the actual model building and the target column. We have removed the speed feature from the predictor columns as it highly correlated with distance and can lead to multicollinearity.

base_line_col=['distance']
predictor_cols=['passenger_count','distance','store_and_fwd_flag_N','store_and_fwd_flag_Y',
               'pickup_day_Friday','pickup_day_Monday','pickup_day_Saturday','pickup_day_Sunday',
               'pickup_day_Thursday','pickup_day_Tuesday','pickup_day_Wednesday','dropoff_day_Friday',
               'dropoff_day_Monday','dropoff_day_Saturday','dropoff_day_Sunday','dropoff_day_Thursday',
               'dropoff_day_Tuesday','dropoff_day_Wednesday','pickup_month_1','pickup_month_5','pickup_month_6',
               'dropoff_month_1','dropoff_month_5','dropoff_month_6','pickup_timezone_late night',
               'pickup_timezone_midday','pickup_timezone_morning','dropoff_timezone_evening',
               'dropoff_timezone_late night','dropoff_timezone_midday','dropoff_timezone_morning']
target_col=['trip_duration']

We will define a function which will take the model object, the test data, the train data, the predictor columns and the target columns. We will use Root Means Squared Error as the evaluation metric. More on that here.

The function will print out the RMSE of the train data, the average value of the RMSE at each fold of K-Fold cross validation and the test data. It will also return the predicted values on the test data. We have imported the required libraries.

from sklearn import  metrics
from sklearn.model_selection import cross_val_score
def modelfit(estimator,data_train,data_test,predictors,target):
    #print(data_train.head())
    #fitting model
    estimator.fit(data_train[predictors],data_train.loc[:,target])
    #train data prediction
    train_pred=estimator.predict(data_train[predictors])
    #cross_validation score
    cv_score=cross_val_score(estimator,data_train[predictors],data_train.loc[:,target],cv=20,scoring='neg_mean_squared_error')
    
    cv_score=np.sqrt(np.abs(cv_score))
    #Print model report:
    print ("\nModel Report")
    print ("RMSE on Train Data: %.4g" % np.sqrt(metrics.mean_squared_error(data_train.loc[:,target].values, train_pred)))
    print ("CV Score : Mean - %.4g | Std - %.4g | Min - %.4g | Max - %.4g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))
    
    test_pred=estimator.predict(data_test[predictors])
    print ("RMSE on Test Data: %.4g" % np.sqrt(metrics.mean_squared_error(data_test.loc[:,target].values, test_pred)))
    
    
    
    return test_pred

We will now split the data into train and test data into 80:20 ratio

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
validation_size = 0.20
seed = 7
X_train, X_test = train_test_split(data2,test_size=validation_size, random_state=seed)

We will first create a model using the mean value as the predicted value for each test data point.

mean_pred=np.repeat(X_train[target_col].mean(),len(X_test[target_col]))
from sklearn.metrics import mean_squared_error as mae
sqrt(mae(X_test[target_col],mean_pred))

The RMSE from this is 674.5795545337122. We will now use this value as the base and try to achieve a RMSE less than this.

Next we will take the distance feature as the only predictor columns and build a linear regression model. We will have a look at the RMSE obtained.

alg1 = LinearRegression(normalize=True)
print('The baseline model')
y_pred=modelfit(alg1, X_train, X_test,base_line_col,target_col)
coef1 = alg1.coef_
print('The coeffient is {}'.format(coef1))

As we can see, all the values are much less than the mean prediction RMSE. Thus our model worked better. We printed the coefficient fitted to the model as well.

Now we will take all the values of the predictor columns and build a regression model.

alg2 = LinearRegression(normalize=True)y_pred=modelfit(alg2, X_train, X_test, predictor_cols,target_col)
coef1 = pd.Series(alg2.coef_[0], predictor_cols).sort_values()
coef1.plot(kind='bar', title='Model Coefficients')

We see that the regression model with all the columns performed even better. We also plotted the coefficients fitted for each feature.

We will now plot the residuals to see whether the assumption of Homoskedasticity.

residuals=y_pred-X_test[target_col]Heteroskedasticity
plt.figure(figsize=(10, 6), dpi=120, facecolor='w', edgecolor='b')
f = range(0,145510)
k = [0 for i in range(0,145510)]
plt.scatter( f, residuals, label = 'residuals')
plt.plot( f, k , color = 'red', label = 'regression line' )
plt.xlabel('fitted points ')
plt.ylabel('residuals')
plt.title('Residual plot')
plt.legend()

Here the distribution of the residuals is Homoscedastic. So the assumption of Linear Regression holds true.

That’s it for today. We will later explore more ways to improve our model and feature selection.

Thanks for reading!

Linear Regression Model on the New York Taxi Trip Duration Dataset using Python

Written by Anuradha Das