Green Taxi Analysis- NYC

Rohit Kabra
Web Mining [IS688, Spring 2021]
9 min readMay 13, 2021

Rohit Kabra | Jinal Kalpesh Shah | Rahul Araveti

Transportation is an essential part of our lives, and in New York City where driving is not a viable option most of the time, public transportation and taxis are the only way to get around. — Author: Simi Linton

Green Taxis also known as Boro Taxis (or Boro Cabs) was a bold attempt to try to create a balance between the taxi owners’ business needs and the needs of tens of thousands of New Yorkers who live and work outside of Manhattan to hail a cab. For Instance, If you’re a cab driver, it makes simple sense to work the areas that are the densest, where the most need is. Instead of ten minutes between fares, maybe you’d have one. These Cabs were introduced considering the fact that yellow cab drivers focused almost entirely on Manhattan’s central business district and ignored all other neighborhoods like Washington Heights, Sunset Park in Brooklyn and Corona in Queens. It’s hard to blame drivers for trying to survive under what are essentially inadequate market conditions.

Hence, In August 2013, the New York City Taxi and Limousine Commission introduced a fleet of Green cabs to the city of New York with an aim to meet the surplus demand for taxi rides in the outskirts of New York City. These Green cabs were introduced with the goal of providing the residents of Brooklyn, Queens, the Bronx, and Upper Manhattan more access to metered taxis.

Areas Map of NYC where Green Taxis Serve.

The Green Taxis can only pick up passengers from the streets in northern Manhattan (north of West 110th Street and East 96th Street), and anywhere else in the 4 boroughs (except the airport area).
They can drop you off anywhere (including the airport area) in the 5 boroughs.

To understand better how this program works, take a look at the following map:
– The green color corresponds to the zone where the green cabs can pick you up and drop you off,
– The yellow and grey colors are the zone where the green cabs can only drop you off.

With a high demand of pre-booking taxi apps such as Uber and Lyft, the business of hail services seems to be decreasing in NYC. One of the reports from Forbes suggests that Green Taxis initially performed really well in some of the NYC boroughs but later the revenue dropped down tremendously. Hence, It would be interesting to analyze the recent performance of green taxis and what efforts the NYC Taxi and Limousine Commission and the drivers can take to generate more revenue.

Tools and Technologies:

To build this model we used Python in Anaconda Jupyter Notebook — an open source, web based IDE. To perform our analysis we downloaded several python packages and libraries such as requests, pandas, numpy, scipy, seaborn, matplotlib.

Data Source:

The Data used for this model was obtained from NYC Open Data website. NYC open data provides a complete insights about green taxis trip from January 2020 until December 2020.

To build this model we used data collected via SODA API. The Socrata Open Data API (SODA) provides programmatic access to this dataset including the ability to filter, query, and aggregate data. https://data.cityofnewyork.us/resource/pkmi-4kfn.json

#Get data of NYC green Taxi for 2020
url = "https://data.cityofnewyork.us/resource/pkmi-4kfn.json"
taxi = http.request('GET', url)
taxi.status
# decode json data into a dict object
data = json.loads(taxi.data.decode('utf-8'))
data

The above code returns the data in json format.

Green Taxi Data in json format.

After converting our dataset to pandas data frame, we obtain 1,734,051 rows and 19 columns.

2020 Green Taxi Data set

Data Set:

Each row represents a single trip in a green taxi in 2020. Every column indicates fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. Let’s first determine each column along with its data type.

VendorID — Number

lpep_pickup_datetime — Date & Time

lpep_dropoff_datetime — Date & Time

store_and_fwd_flag — Plain Text

RatecodeID — Number

PULocationID — Number

DOLocationID — Number

passenger_count — Number

trip_distance — Number

fare_amount — Number

extra — Number

mta_tax — Number

tip_amount — Number

tolls_amount — Number

ehail_fee — Plain Text

improvement_surcharge — Number

total_amount — Number

payment_type — Number

trip_type — Number

congestion_surcharge — Number

Now we are ready with our dataset. Before we begin our analysis, we first need to prepare our data set. It is very important to remove all the unwanted columns and filter out all Null values.

Data Preparation:

We begin to prepare our data by removing ehail_fee & store_and_fwd_flag coulmn. Further we get rid of the null values in our data set.

results_df= results_df.drop(columns=['ehail_fee', 'store_and_fwd_flag'])result_df.dropna()
2020 Green Taxi Data set — Resultant data set after removing unwanted columns and null values.

The returned data set consist of 1,205,954 rows and 18 columns.

Exploratory Data Analysis:

Our first goal is to analyze the average trip distance taken by the cab. This would determine how are data deals with the trip distance. This could be done by plotting a histogram.

mean = np.mean(results_df.trip_distance)
std = np.std(results_df.trip_distance)
plt.hist(results_df.trip_distance, bins = 40,color = 'b')
plt.title("Trip Distance Distribution With Outliers")
plt.xlabel("Trip Distance in miles")

The code results in a histogram which is right skewed and since the data in the first figure stretches up to 200000, it is evident that there are many outliers in the data set.

Histogram representing the trip distance.

It is very important to consider and address the outliers in any data set as outliers can skew the results and mislead the interpretations. They even have a significant effect on the mean and standard deviation. These outliers can be determined using Python’s seaborn library feature of plotting box-whisker plots.

Seaborn: A library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.

import seaborn as sns
plt.figure(figsize=[10,2])
sns.boxplot(results_df['trip_distance'])
Box-Whisker plot determining the trip distance.
Outliers of trip distance.

The outliers were causing the distribution to be extremely skewed to the right.

After we remove all the outliers we again plot the histogram for the trip distance.

Histogram of trip distance without outliers.

Analysis: The above graph describes that the average distance travelled by the green taxis is mostly short distance trips between 0 to 5 miles.

Now lets determine the mean and median trip distance based on the hour of the day. We will determine this by creating a new column — Hour which will store this value.

df["hour"] = pd.to_datetime(df.lpep_pickup_datetime).dt.hour

Similarly we will create column for day of the week as well.

df["dayofweek"] = pd.to_datetime(df.lpep_pickup_datetime).dt.dayofweek

Now let’s determine the graphs for Mean and Median trip distance considering hour of the day and also a graph for mean and median trip distance considering day of the week.

mean trip distance per hour of day
vals = df.groupby("hour")["trip_distance","hour"].mean()
axs[0][0].bar(vals.hour,vals.trip_distance,color = 'r')
axs[0][0].set_title("Mean Trip Distance vs Hour of day")
axs[0][0].set_xlabel("Hour of Day")
axs[0][0].set_xticks(vals.hour)
axs[0][0].set_ylabel("Mean Trip Distance (miles)")
median trip distance per hour of day
vals = df.groupby("hour")["trip_distance","hour"].median()
axs[0][1].bar(vals.hour,vals.trip_distance, color = 'r')
axs[0][1].set_title("Median Trip Distance vs Hour of day")
axs[0][1].set_xlabel("Hour of Day")
axs[0][1].set_xticks(vals.hour)
axs[0][1].set_ylabel("Median Trip Distance (miles)")
print("="*60)
Mean and Median Trip Distance vs Hour of Day

From the above graph we can say that the trip distance is highest during the morning hours. This means that people usually prefer travelling for long distances in green taxis during early morning.

Similarly, lets obtain a graphical representation on which days of the week do people travel for maximum distance.

Mean trip distance per day of the week
vals = df.groupby("dayofweek")["trip_distance","dayofweek"].mean()
axs[1][0].bar(vals.dayofweek,vals.trip_distance, color = 'g')
axs[1][0].set_title("Mean Trip Distance vs Day of the Week")
axs[1][0].set_xlabel("Day of the Week")
axs[1][0].set_xticks(vals.dayofweek)
axs[1][0].set_ylabel("Mean Trip Distance (miles)")
#Median trip distance per day of the week
vals = df.groupby("dayofweek")["trip_distance","dayofweek"].median()
axs[1][1].bar(vals.dayofweek,vals.trip_distance, color = 'g')
axs[1][1].set_title("Median Trip Distance vs Day of the Week")
axs[1][1].set_xlabel("Day of the Week")
axs[1][1].set_xticks(vals.dayofweek)
axs[1][1].set_ylabel("Median Trip Distance (miles)")
Mean and Median Trip Distance vs Day of the week.

From the above graph it is clear that people uses green taxis for longer routes very less on Wednesdays and Thursdays.

Let’s further Determine, the number of rides during the hour of the day. This would gives us a more clear picture of our trip distance analysis.

vals = df.groupby("hour")["vendorid"].count()
axs[2][0].bar(range(0,24),vals.values,color = 'b')
axs[2][0].set_title("Number of Trips vs Hour of day")
axs[2][0].set_xlabel("Hour of Day")
axs[2][0].set_xticks(range(0,24))
axs[2][0].set_ylabel("Number of Trips")
Number of trip during hour of the day.

Now, if we will carefully examine the graph then it is evident that, the rides booked during night hours are very less compared to the rides booked starting from early morning.

This gets to our first conclusion of our exploratory data analysis on trip distance:

Analysis: Green Taxis, cover very less number of fares during late night hours, we can interpret that during late night hours people prefer to pre book their cabs with regards to safety concern. Majority of the bookings begin to start early morning during office hours and as we can see there is a spike in number of cabs booked around 5pm and 6pm, this is the time when usually the offices close and people rush towards home. Due to dense population of New York it is not always feasible for users to pre book their ride and due to long waiting time with these pre booking cab services they prefer to take hail service- Green Taxi!

We can draw one more conclusion from the above graphs. When we compare the graph of trip distance during hour of the day with number of rides booked during hour of the day, we see that Maximum of the distance is travelled during 5am-7am whereas the number of cabs booked during these hours is very less. Based on this we can draw a conclusion, that people only take longer routes during early morning hours and one such example could be travelling to the Airport.

Conclusion:

The above analysis can be concluded and the drivers can play smart by examining these analysis. This Analysis could help drivers to be present at residential location during early morning hours for pick up and be present at the corporate area of New York during closing hours. This would give them an opportunity to earn more rides thus making more money. This analysis will also the help the TLC to plan different strategies to obtain more revenue out of green taxis.

Limitations:

The data set provide too many null values, hence determining the exact analysis would be difficult. This Data set cannot be used by the TLC to track the movement the green taxis. For example: Usually for the longer rides the the total far amount displayed is zero, this means that it is highly possible that the driver fix the prizes for many long route rides and hence the fare amount is reflected as zero.

Future Scope:

This analysis can be combined with NYC location Data and it would be determined which Neighborhood has the maximum demand of Green Taxis. Also, We can gather data of new york traffic and can determine how much traffic is caused by the green taxis. Further, obtaining the Uber dataset we can compare the revenue model pre and post introducing e-hail services in New York.

Lessons Learned:

While building this model, we became acquainted about how green taxis could determine the their trips and the drivers can make maximum use of this analysis and fill their bank accounts.

We came across several API’s and tools and technology to fetch data from these API’s. We dealt with the real world data which helped us to think more in analytically and helped us develop various skill set in python and its libraries.

References:

--

--