NYC Green Taxi Analysis

Rachit Agarwal
Analytics Vidhya
5 min readAug 2, 2020

--

In my free time, I took NYC Green Taxi data for the year 2015 to see what recommendations I can provide to Green Taxi drivers using Machine Learning. Therefore to learn more about the background of the data & what are the potential options available to download , I visited NYC OpenData website for 2015 Green Taxi Trip Data. Completed the analysis using Python & code can be found on my Github repo.

About the Data set

Complete Green Taxi data for the year 2015 consist of 19.8 million rows with 21 columns.It includes trip records from all trips completed in green taxis in NYC in 2015. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Problem Handling

I break down the problem into 3 phases:

  • Data acquisition & wrangling
  • Exploratory Data Analysis
  • Predictive modelling

Lets jump into all the 3 phases sequentially in detail

Data Acquisition & Wrangling

Captured the data by downloading the CSV for year 2015 from the website. Explored & downloaded NYC zonal data opendatasoft. Alternatively : Data can also be fetched through API’s request.

Data Cleaning:

  • Removed Null values from the Data as Ehail_fee have 100% of the NULL values.
Boxplot for Pickup & Dropoff Latitude & Longitude
Boxplot for Pickup & Dropoff Latitude & Longtitude
  • Box-plot on the above shows how the Pickup & Drop off coordinate’s have some cases which are outlier’s in our case i.e. One of the Pickup location was near UK.
Boxplot for Trip Distance
  • Above box plot shows the distribution of the Trip distance. Usually customer tend to take short distance therefore removing the outliers reduced skewness in the data.

Feature Engineering :

  • Converted Pickup time from 12 hours format to 24 hours format & developed new column Date & time.
  • Developed new feature coordinates which comprises of Latitude & Longitude in both the datasheets so that it can be mapped efficiently instead of searching on two columns.
  • Combined coordinates from both the data set divided the NYC state into different cities.

Exploratory Data Analysis

Data Distribution :

Bar charts shows the distribution of the data across different categories
  • 1st bar chart states, two vendor provide Green cab & 2nd id the most dominating in the market.
  • 2nd and 3rd chart shows usually passengers tend to travel alone followed by in a pair or in a group & usually customer travel in a standard fare category.

Trips exploration :

Trips exploration bar chart
  • After cleaning the dataset, we can say normally customers don’t tend to take longer journey’s in the cab.
  • And if we analyze the pattern of the distance travelled by hour, then we can clearly see during early morning’s hours or after office hours customer take long journey’s there this could be potential assumption customer is travelling to airports.

Pick-Up & Drop-off Pattern :

Pickup & Drop off scatter plot
  • Above pickup & drop off pattern shows people travels to the down-town area.
  • Might be Tourist tend to travel from airport & like to deboard most attractive NYC spots.Although green taxi are not allowed to pick the customers within the downtown.
  • Or else employee take cabs from there respective home to Bay areas near Brooklyn.

Predictive Modelling

In this phase of the problem statement, I build an unsupervised predictive model to provide recommendations to the NYC Green taxi driver what could be the potential area’s drivers can choose which even won’t affect the amount drivers making at the present stage.

After loading only 20% of the random population for the entire dataset to prevent any memory issues.Performed the similar above data cleaning process to get the final data for building the model.

Data Modelling for K-Means

Grouped the cities in NY state according to the pick-up counts distributed across the day with a breakdown in 24 slots.From this, we can develop the clusters & see which cities are grouped together so that if any driver wants to change its working location, he/she can.

The total count of the pickups in each city of NYC state have different count, therefore need to normalize the count to get the unbiased results.

Scaled Data for K-Means

Scaled it using the Normalizing technique (Value/Total).

K-Means model building :

In NYC state we have 35 zones, I just used 10 clusters to divide 95 cities into each. I could have done better, if I could divide the NYC states into sectors but I couldn’t find any relevant data source to map it.

Recommendation

  • An important factor a driver can think if he/she wants to move in different location.

Vendor iD | Passenger_Count | Trip_distance | Fare_Amount | Tip Amt | Total Amt.

  • Woodhaven belongs to cluster 0 & cluster 0 have 57 other cities too but if filter on Total amount & total number of pickup then we are left with only 8 other cities.

Hence, driver who drives in Woodhaven & earns 17$ in each trip with 253 pickup can drive in 8 other cities too where his/her revenue won’t be impacted.

Future Work

For the future work, additional dataset such as NYC weather can be plugged to further predict the surge in the particular city or area which can benefit the green cab driver.

--

--

Rachit Agarwal
Analytics Vidhya

Data Scientist with 4 years mix experience of R&D Lab & Consulting