Green Taxi Data Analysis in New York City
Taxis are a popular mode of transportation in New York City, which initially had only one type of taxi, the yellow cab. Over time, the taxi operation zone centered on the Manhattan area. In August 2013, the New York City Taxi and Limousine Commission introduced the Green Taxi Fleet to New York City. Green Taxi, also known as Boro Taxi (or Green Cabs), is an effort to try to create a balance between the business needs of taxi owners and the needs of the tens of thousands of New Yorkers who live and work to hail taxis. However, the number of green taxi pickups in New York City is still low.
As a data analyst, we were tasked with analyzing historical taxi trip data to identify user characteristics in an effort to increase green taxi users. Once we know the characteristics of users who have used green taxis, we can then provide recommendations to relevant stakeholders, such as the executive team and marketing team, to formulate marketing strategies and future business development.
In this blog, I will share my work on analyzing green taxi in New York City using data collected by the NYC Taxi and Limousine Commission. The following blog contains the report of my analysis of the January 2023 data. Data processing and analyses are completed using Python. The code and outputs are in my GitHub, don’t forget to check file README. You can also check my Tableau Dashboard to get a summary of my analysis in the form of a visualization (here).
Data Structure
This project uses three datasets from three different CSV files. The first file contains a dataset of historical taxi trip (here), the second file contains data on the names of borough zones in New York City and their location ID (here). And the last file is a dataset containing population counts for each borough in New York City (here).
1. Datasets Taxi Trip
The green taxi data from January 2023 has 68.211 rows and 20 columns. The features available in this dataset can be categorized into the following domains:
Numerical Discrete (13):
- VendorID, RatecodeID, PULocationID, DOLocationID, passenger_count, extra, mta_tax, tolls_amount, ehail_fee, improvement_surcharge, congestion_surcharge, payment_type, trip_typeVendorID, RatecodeID, PULocationID, DOLocationID, passenger_count, extra, mta_tax, tolls_amount, ehail_fee, improvement_surcharge, congestion_surcharge, payment_type, trip_type
Numerical Continuous (4):
- trip_distance, fare_amount, tip_amount, total_amount
Datetime (2):
- lpep_pickup_datetime, lpep_dropoff_datetime
Categorical Nominal (1):
- store_and_fwd_flag
2. Datasets Location ID
Datasets Location ID has 265 rows and 4 columns. The features available in this dataset can be categorized into the following domains:
Numerical Discrete (1):
- LocationID
Categorical Nominal(3):
- Borough, Zone, Service_Zone
3. Datasets Population
Datasets Location ID has 177 rows and 6 columns. The features available in this dataset can be categorized into the following domains:
Numerical Discrete (1):
- Zip
Categorical Nominal(3):
- Borough, post_office, Neighborhood
Numerical Continuous (2):
- population, density
Data Preparation
We begin to prepare our datasets by merging three CSV files into one dataset named df. Then we remove the ehail_fee and store_and_fwd_flag columns, these columns are not in the description and are not used in our analysis.
The first step that I took in the data preparation was to select data that was not used in the analysis and data that did not make sense. Records that met the following criteria were excluded from all subsequent analyses:
- Exclude LocationID 264 and 265. These location IDs are not in the data description.
- Exclude data outside of January 2023.
- Exclude the drop-off zone at the “Country Club”. This will affect the filling of missing values in the congestion_surcharge columns.
- Exclude RatecodeID 99. These value are not in the data description.
- Exclude data duplication.
- Eliminate data anomalies.
The next step is to handle datasets that may affect the result of the analysis:
- Handling Negative Value, negative values in this data set are impossible and unrealistic. Those values are costs that must be paid, such as the amount of fares, extras, taxes, tips, and other additional costs. The solution is that we absolutize these values.
- Handling Missing Value, filling in missing values follows the following rules:
- congestion_surcharge = Fill in the missing value for this column by looking at the drop-off zone. Zones that have a chance of congestion above 0.7 are assumed to be congested, so we fill in a value of 2.75. In addition, we assume that the zone does not experience congestion, so we give it a value of 0.
- DO_Population = The missing value in this column is because the zone is in Newark Airport, and this zone is recorded as having no population, so we fill it with zero population.
- DO_Density = same condition as the DO_Population column, we fill it with zero density.
- RatecodeID = The filling of this field follows the rules set by the NYC TLC Trip (here). We fill in RatecodeID as 1 if the standard fare is set for the trip, 2 if the drop-off zone is JFK airport, 3 if the drop-off zone is Newark airport, 4 if the drop-off zone is “Nassau” or “Westchester”, 5 if the negotiated fare is set for the trip, and 6 if the group ride is set for the trip.
- passanger_count =We fill these columns with a value of 1, following the minimum number of users for each trip.
- payment_type = We fill these columns with a value of 5, with 5 being the unknown category.
- trip_type = We fill these columns with a value of 3, and we create a new category for the type of trip, which is the “other” category. - Handling Outlier
- First selecting the implausible values for each column with nan values, then doing imputation using the multivariate imputation method (the explanation of the multivariate imputation method can be seen in the following medium post). After imputation, we check whether the distribution after handling outliers has the same distribution pattern. Then we use the KS test with two samples to test this, and we get a p-value greater than the alpha value of 0.05, then we have enough evidence that the two samples have the same distribution pattern.
The next step is to add a new column, namely the trip duration column, which is obtained from the difference between the dropoff datetime column and the pickup datetime column. And the average speed column, which is obtained from the trip distance divided by the trip duration.
And the last step is to drop the zip, post_office, neighborhood, and vendorID columns, as these columns are not used in the analysis to be performed.
Analyze
- Total Pickup per Borough
From the graph above, we can take away the insight that the distribution of pick-up taxis is uneven for each borough in New York City, with the highest proportion of pick-ups in Manhattan at 60.06% of the total and the lowest proportion of pick-ups in Staten Island (one of the suburbs) at a very low proportion of 0.02% of the total trips.
2. Market Potential Green Cabs
The market potential graph above is obtained by dividing the number of residents who have used green taxis by the total population of each Borough. We can take insight from the fact that Manhattan has a high proportion of about 3.28% of the total population using green taxis (51.736 out of 61 million residents). The second highest proportion is in Queens, with 0.97% of the total population using green cabs. For the Bronx, Brooklyn, and Staten Island, since these three boroughs have a fairly low proportion (below 0.5% of the total population), we only conduct further analysis for Manhattan and Queens.
3. Drop-off Zone per Borough
Next, let’s look at the five highest drop-off zones for each Borough. It can be seen from the graph above that in the borough of Manhattan, the highest drop-off zones are in East Harlem South, East Harlem North, Upper East Side North, Central Harlem, and Upper West Side North. In the city of Queens, the most drop-off zones are in Forest Hills, Jackson Heights, Astoria, Elmhurst, and LaGuardia Airport.
4. Distribution Pickup Date
Based on the line diagram above, in the city of Manhattan, on weekdays, the number of green taxi users always increases and always decreases on weekends. This indicates that in Manhattan, taxi users are dominated by residents who are going to work, considering that Manhattan is the center of business in New York City. Taxi users in Queens do not always experience an increase in the number of green taxi users in the weekday category, as seen on the 2nd and 16th, there was no increase in taxi users.
5. Number of Pickup per Hour
In Manhattan, in the Weekday category, peak hours occur during commuting hours and after office hours. In the Weekend category, it tends to be stable. The number of taxi users on weekends is much lower than on weekdays. For the city of Queens in the Weekday category, the number of taxi users always increases from 7 a.m. to 6 p.m. and experiences a decline in demand from 7 p.m. to 5 a.m. the next day. In the Weekend category, the pattern is similar to that of the Weekday category, but the frequency of taxi use is lower than that of the Weekday category. The number of taxi users on weekends is lower than on weekdays.
6. Proporsi of Payment Type per Borough
Green taxi users traveling from Manhattan are more likely to use a credit card as a payment method than cash or other payment methods. Taxi users traveling from the Queens section of the city more often use credit cards and cash as payment methods.
Summary
From the above analysis, it can be concluded that taxis are preferred as a mode of transportation for low-distance trips (considering the taxi fare). Manhattan’s most popular drop-off zones are in East Harlem South, East Harlem North, Upper East Side North, Central Harlem, and Morningside Heights. Queens’ most popular drop-off zones are in Forest Hills, Jackson Heights, Astoria, Elmhurst, and LaGuardia Airport. The highest demand in Manhattan occurs on weekdays during office hours (8–9 a.m.) and office hours (3–6 p.m.). On weekends, it tends to be stable (11 a.m. to 7 p.m.). In Queens, taxi demand is always increasing (7 a.m.–6 p.m.) both on weekdays and weekends, but the frequency on weekends is lower than on weekdays. Users in Manhattan have a higher proportion of credit cards compared to other payment methods. The proportion of both credit card and cash payment methods for users who come from Queens has the same proportion. This analysis will also help the TLC plan different strategies to obtain more pickup trips from green taxis.
Recommendations
Recommended business strategies in the borough of Manhattan are:
- The company can provide rebates, coupons, and promotions for trips destined for East Harlem South, East Harlem North, Upper East Side North, and Central Harlem.
- The company can lower the fares of metered taxis (further analysis is needed to lower the fares).
- Maximize the number of green taxi fleets during office departure and return hours (08.00–09.00) and (15.00–16.00) for weekdays. For weekends, companies can propose or provide weekend promos (11.00–19.00).
- Provide users with cashback promotions by using Credit Card payment methods.
- Apply a referral code because green taxi users in Manhattan are dominated by company employees.
Recommended business strategies in the state of Queens are:
- The company can also provide discounts, coupons, and promotions for trips with final destinations in Forest Hills, Jackson Heights, Astoria, Elmhurst, and LaGuardia Airport.
- The company can reduce the taxi meter rate in Queens, which is the largest borough in New York (further analysis is needed to reduce the meter rate).
- Maximize the green taxi fleet on weekdays with the highest peak demand at 18:00.
- Can provide cashback promotions for users who use credit cards or cash.