Exploratory Data Analysis of New York Taxi Trip Duration Dataset using Python

Published in

Analytics Vidhya

11 min readAug 1, 2019

Anuradha took the Applied Machine Learning course and presents her project on the popular NYC Taxi Trip Duration dataset.

Data Analysis is one of the most crucial steps of the model building process. In this article I will be performing Data Analysis on the NYC Taxi Trip Duration Dataset. This dataset and problem statement is taken from the Applied Machine Learning course by Analytics Vidhya which offers a number of such real life projects.

Let us now discuss about the problem statement for the project.

Problem Context:

A typical taxi company faces a common problem of efficiently assigning the cabs to passengers so that the service is smooth and hassle free. One of main issue is determining the duration of the current trip so it can predict when the cab will be free for the next trip.

The data set contains the data regarding several taxi trips and its duration in New York City. I will now try and apply different techniques of Data Analysis to get insights about the data and determine how different variables are dependent on the target variable Trip Duration.

Lets start!

Import Required Libraries

First we will import all the necessary libraries needed for analysis and visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
sns.set()

Now that we have all the necessary libraries lets load the data set. We will load it into the pandas DataFrame df.

df=pd.read_csv('nyc_taxi_trip_duration.csv')

We read the dataset into the DataFrame df and will have a look at the shape , columns , column data types and the first 5 rows of the data. This will give a brief overview of the data at hand.

df.shape

This returns the number of rows and columns

df.columns

Here’s what we know about the columns:

Demographic information of Customer & Vendor

id : a unique identifier for each trip

vendor_id : a code indicating the provider associated with the trip record

passenger_count : the number of passengers in the vehicle (driver entered value)

Information about the Trip

pickup_longitude : date and time when the meter was engaged

pickup_latitude : date and time when the meter was disengaged

dropoff_longitude : the longitude where the meter was disengaged

dropoff_latitude : the latitude where the meter was disengaged

store_and_fwd_flag : This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)

trip_duration : (target) duration of the trip in seconds

Thus we have a data set with 729322 rows and 11 columns. There are 10 features and 1 target variable which is trip_duration

df.dtypes

This returns the data type of the columns

df.head()

This returns the first 5 rows of the Data set

Thus we get a glimpse of the data set by looking at the first 5 rows returned by df.head(). Optionally we can specify the number of rows to be returned, by sending it as a parameter to the head() function.

Some observations about the data:

The columns id and vendor_id are nominal.
The columns pickup_datetime and dropoff_datetime are stored as object which must be converted to datetime for better analysis.
The column store_and_fwd_flag is categorical

Lets look at the numerical columns,

df.describe()

This returns a statistical summary of the numerical columns

The returned table gives certain insights:

There are no numerical columns with missing data
The passenger count varies between 1 and 9 with most people number of people being 1 or 2
The trip duration varying from 1s to 1939736s~538 hrs. There are definitely some outliers present which must be treated.

Lets have a quick look at the non-numerical columns,

non_num_cols=['id','pickup_datetime','dropoff_datetime','store_and_fwd_flag']
print(df[non_num_cols].count())

The count of the specified columns are returned

There are no missing values for the non numeric columns as well.

The 2 columns pickup_datetime and dropoff_datetime are now converted to datetime format which makes analysis of date and time data much more easier.

df['pickup_datetime']=pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime']=pd.to_datetime(df['dropoff_datetime'])

Univariate Analysis

Lets have a look at the distribution of various variables in the Data set.

Passenger Count

sns.distplot(df['passenger_count'],kde=False)
plt.title('Distribution of Passenger Count')
plt.show()

A histogram of the #of passengers in each trip

Here we see that the mostly 1 or 2 passengers avail the cab. The instance of large group of people travelling together is rare.

The distribution of Pickup and Drop Off day of the week

df['pickup_datetime'].nunique()
df['dropoff_datetime'].nunique()

The returned values are 709359 and 709308. This shows that there are many different pickup and drop off dates in these 2 columns.

So its better to convert these dates into days of the week so a pattern can be found.

df['pickup_day']=df['pickup_datetime'].dt.day_name()
df['dropoff_day']=df['dropoff_datetime'].dt.day_name()

Now lets look at the distribution of the different days of week

df['pickup_day'].value_counts()

A frequency distribution of the different pickup days.

df['dropoff_day'].value_counts()

A frequency distribution of the different dropoff days.

Thus we see most trips were taken on Friday and Monday being the least. The distribution of trip duration with the days of the week is something to look into as well.

The distribution of days of the week can be seen graphically as well.

figure,ax=plt.subplots(nrows=2,ncols=1,figsize=(10,10))
sns.countplot(x='pickup_day',data=df,ax=ax[0])ax[0].set_title('Number of Pickups done on each day of the week')
sns.countplot(x='dropoff_day',data=df,ax=ax[1])ax[1].set_title('Number of dropoffs done on each day of the week')
plt.tight_layout()

The distribution of the # of pickups and drop offs done on each day of the week

The distribution of Pickup and Drop Off hours of the day

The time part is represented by hours,minutes and seconds which is difficult for the analysis thus we divide the times into 4 time zones: morning (4 hrs to 10 hrs) , midday (10 hrs to 16 hrs) , evening (16 hrs to 22 hrs) and late night (22 hrs to 4 hrs)

def timezone(x):
    if x>=datetime.time(4, 0, 1) and x <=datetime.time(10, 0, 0):
        return 'morning'
    elif x>=datetime.time(10, 0, 1) and x <=datetime.time(16, 0, 0):
        return 'midday'
    elif x>=datetime.time(16, 0, 1) and x <=datetime.time(22, 0, 0):
        return 'evening'
    elif x>=datetime.time(22, 0, 1) or x <=datetime.time(4, 0, 0):
        return 'late night'
    
df['pickup_timezone']=df['pickup_datetime'].apply(lambda x :timezone(datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").time()) )df['dropoff_timezone']=df['dropoff_datetime'].apply(lambda x :timezone(datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S").time()) )

Lets look at the distribution of the timezones

figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(10,5))sns.countplot(x='pickup_timezone',data=df,ax=ax[0])
ax[0].set_title('The distribution of number of pickups on each part of the day')sns.countplot(x='dropoff_timezone',data=df,ax=ax[1])
ax[1].set_title('The distribution of number of dropoffs on each part of the day')plt.tight_layout()

The distribution of # of pickups and drop offs done on each part of the day

Thus we observe that most pickups and drops occur in the evening. While the least drops and pickups occur during morning.

Lets have another column depicting the hour of the day when the pickup was done.

figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(10,5))
df['pickup_hour']=df['pickup_datetime'].dt.hour
df.pickup_hour.hist(bins=24,ax=ax[0])
ax[0].set_title('Distribution of pickup hours')df['dropoff_hour']=df['dropoff_datetime'].dt.hour
df.dropoff_hour.hist(bins=24,ax=ax[1])
ax[1].set_title('Distribution of dropoff hours')

The distribution of # of pickups and drop offs done on each hour of the day

The 2 distributions are almost similar and are also aligned with the division of the hours of the day into 4 parts and their distribution done previously.

Distribution of the stored and forward flag

df['store_and_fwd_flag'].value_counts()

The returned frequency distribution of the Yes/No Flag

The number of N flag is much larger. We can later see whether they have any relation with the duration of the trip.

Distribution of the trip duration

sns.distplot(df['trip_duration'],kde=False)
plt.title('The distribution of of the Pick Up  Duration distribution')

This histogram shows extreme right skewness, hence there are outliers. Lets see the boxplot of this variable.

ns.boxplot(df['trip_duration'], orient='horizontal')
plt.title('A boxplot depicting the pickup duration distribution')

Thus we see there is only value near 2000000 while all the others are somewhere between 0 and 100000. The one near 2000000 is definitely an outlier which must be treated.

Lets have a look at the 10 largest value of trip_duration.

print( df['trip_duration'].nlargest(10))

The returned 10 largest value in the column

The largest value is much greater than the 2nd and 3rd largest trip duration value. This might be because of some errors which typically occurs during data collection or this might be a legit data. Since the occurrence of such a huge value is unlikely so its better to drop this row before further analysis.

The value can be replaced by the mode or median of trip duration as well.

df=df[df.trip_duration!=df.trip_duration.max()]

Lets have a look at the distribution of the trip_duration after we have dropped the outlier.

sns.distplot(df['trip_duration'])
plt.title('Distribution of the pickup ditribution after the treatment of outliers')

The distribution of the trip duration in seconds after removing the outlier

Still there is an extreme right skewness. Thus we will divide the trip_duration column into some interval.

The intervals are decided as follows:

less than 5 hours
5–10 hours
10–15 hours
15–20 hours
more than 20 hours

bins=np.array([0,1800,3600,5400,7200,90000])
df['duration_time']=pd.cut(df.trip_duration,bins,labels=["< 5", "5-10", "10-15","15-20",">20"])

Distribution of pickup longitude

sns.distplot(df['pickup_longitude'])
plt.title('The distribution of Pick up Longitude')

Distribution of drop off longitude

sns.distplot(df[‘dropoff_longitude’])
plt.title(‘The distribution of Drop off Longitude’)

Distribution of dropoff latitude

sns.distplot(df['dropoff_latitude'])
plt.title('The distribution of drop off Latitude')

Distribution of pickup latitude

sns.distplot(df['pickup_latitude'])
plt.title('The distribution of pick up Latitude')

We see that the pickup longitude and the dropoff longitude has almost the same kind of distribution while the pickup latitude and the dropoff latitude has slightly different distribution.

Distribution of vendor_id

df['vendor_id'].hist(bins=2)

The distribution of vendor id is not much different as expected.

Bivariate Analysis

Lets now look at the relationship between each of the variables with the target variable trip_duration.

The relationship between Trip Duration and The day of the week

sns.catplot(x="pickup_day",y="trip_duration",kind="bar",data=df,height=6,aspect=1)
plt.title('The Average Trip Duration per PickUp Day of the week')sns.catplot(x="dropoff_day",y="trip_duration",kind="bar",data=df,height=6,aspect=1)
plt.title('The Average Trip Duration per Dropoff Day of the week')

The graphs denote the average estimate of a trip for each day of the week. The error bars provides some indication of the uncertainty around that estimate

Thus the highest avg time taken to complete a trip is on Thursday while Monday, Saturday and Sunday takes the least time.

But this is not enough. We must also take into consideration the percentage of short, medium and long trips taken on each day.

ax1=df.groupby('pickup_day')['duration_time'].value_counts(normalize=True).unstack()
ax1.plot(kind='bar', stacked='True')
plt.title('The Distribution of percentage of different duration of trips')

The graph shows a percentage distribution of the trips of different duration within each day of the week.

This does not give much insights as the number of trips within 0–5 hours range is much larger for all the days,

Lets look at the percentage of only longer trips (with duration time > 5 hours)

figure,ax=plt.subplots(nrows=1,ncols=3,figsize=(15,5))
ax1=df[(df.duration_time !="< 5")].groupby('pickup_day')['duration_time'].count()
ax1.plot(kind='bar',ax=ax[0])
ax[0].set_title('Distribution of trips > 5 hours')
ax2=df[(df.duration_time !="< 5")].groupby('pickup_day')['duration_time'].value_counts(normalize=True).unstack()
ax2.plot(kind='bar', stacked='True',ax=ax[1])
ax[1].set_title('Percentage distribution of trips > 5 hours')
ax3=df[(df.duration_time !="< 5")].groupby('pickup_day')['duration_time'].value_counts().unstack()
ax3.plot(kind='bar',ax=ax[2])
ax[2].set_title('A compared distribution of trips > 5 hours')

The 3 graphs present 3 types of information here:
The left most graph shows a frequency distribution of the number of trips(> 5 hours ) taken on each day of the week
The middle one shows a percentage distribution of the trips of different duration ( > 5 hours )within each day of the week.
The right one shows the frequency distribution of the trips of different duration (> 5 hours)within each day of the week.

Some key points :

The most number trips which lasts > 5 hours were taken on Thursday followed by Friday and Wednesday.(Left graph)
The most number of trips of duration 5–10, 10–15 was taken on Thursday.(right graph)
But the highest percentage of trips longer than 20 hours was taken on Sunday and Saturday.(middle graph)

The relationship between Trip Duration and The time of the day

figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))ax1.set_title('Distribution of pickup hours')
ax=sns.catplot(x="pickup_hour", y="trip_duration",kind="bar",data=df,ax=ax1)ax2.set_title('Distribution of dropoff hours')
ax=sns.catplot(x="dropoff_hour", y="trip_duration",kind="bar",data=df,ax=ax2)
plt.show()

The highest average time taken to complete a trip are for trips started in midday(between 14 and 17 hours) and the least are the ones taken in the early morning(between 6–7 hours)

The relationship between passenger count and duration

sns.relplot(x="passenger_count", y="trip_duration", data=df, kind="scatter")

Here we see, passenger count has no such relationship with trip duration. But it is noted that there are no long trips taken by higher passengers counts like 7 or 9. while the trip duration time is more or less evenly distributed only for passenger count 1.

The relationship between vendor id and duration

sns.catplot(x="vendor_id", y="trip_duration",kind="strip",data=df)

Here we see that vendor 1 mostly provides short trip duration cabs while vendor 2 provides cab for both short and long trips

The relationship between store forward flag and duration

sns.catplot(x="store_and_fwd_flag", y="trip_duration",kind="strip",data=df)

Thus we see the flag was stored only for short duration trips and for long duration trips the flag was never stored.

The relationship between geographical location and duration

sns.relplot(x="pickup_latitude", y="dropoff_latitude",hue='pickup_timezone',row='duration_time',data=df);

Here’s what we see

for shorter trips (<5 hours), the pickup and dropoff latitude is more or less evenly distributed between 30 ° and 40 °
for longer trips(>5 hours ) the pickup and dropoff latitude is all concentrated between 40 ° and 42 ° degrees.

sns.relplot(x="pickup_longitude", y="dropoff_longitude",hue='pickup_timezone',row='duration_time',data=df);