Looking Through the Taxi Meter - Analysis of the NYC Green Taxi data

Jiamin Han
18 min readFeb 28, 2019

--

Image source: New York Post

This project used NYC green taxi data collected by the NYC Taxi and Limousine Commission. The following blog contains the report of my analysis of the September 2015 data. For a further dive into the green taxi background and data set, see NYC TLC. Data processing and analyses are completed using Python. The code and outputs are in my GitHub repo.

Data Structure

The green taxi data from September 2015 has 1,494,926 rows and 21 columns. The available features in this dataset could be broadly categorized into the following domains.

1) Administrative data:

VendorID, Store_and_fwd_flag, RateCodeID

2) Trip data:

lpep_pickup_datetime, lpep_dropoff_datetime, Pickup_longitude, Pickup_latitude, Dropoff_longitude, Dropoff_latitude, Passenger_count, Trip_distance, Trip_type

3) Payment data:

Fare_amount, Extra, MTA_tax, Tip_amount, Tolls_amount, Ehail_fee, Improvement_surcharge, Total_amount, Payment_type

For brevity, detailed information about these features is not listed in this blog and is available in the data dictionary from NYC Taxi and Limousine Commission.

Analysis of Trip Distance

This section focuses on the distribution of trip distances of NYC green taxi rides during September 2015. The taxi trip data were captured by taxi meters, which were subject to numerous sources of errors, for instance, hardware and software malfunction, wireless signal issue, and human interference. Therefore, I first eliminated records with implausible data based on common sense and the empirical distribution of this dataset. The distribution of trip distance was examined after sample selection.

1. Sample selection

Figure 1. Sample selection flowchart

Figure 1 shows the sample selection process and the number of records being excluded from analyses. A total of 1,441,688 records were included in the analytical dataset. Records that met the following criteria were excluded from all subsequent analyses:

1) Distance of 0 mile or distance ≥ 50 miles;

2) Duration of 0 minute or duration ≥ 200 minutes;

3) Average speed ≤ 1 MPH or average speed of ≥ 240 MPH;

4) Base fare < $2.50 or ≥ $250.00, or tip amount > twice the base fare;

5) With invalid longitude or latitude data, or with a trip distance shorter than the geographic distance (using the Vincenty method [1]) between the pickup and drop off point by more than 1 mile, or traveled outside the 5 boroughs or 3 airports of NYC.

In order to select data according to the above criteria, the following features were derived:

Average speed. The average speed was calculated as the ratio of trip distance to trip time.

Geographic location. To determine if a trip started and ended within the 5 boroughs (Manhattan, Bronx, Brooklyn, Queens, and Staten Island) and the 3 major airports (John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and the Newark Liberty International Airport (EWR)), pickup and drop off points were spatially joined with the NYC taxi zone polygons. Trips with pickup/drop off points that did not fall into any taxi pickup zones were deemed out of the scope of this analysis and eliminated.

Imputation of trip distance and duration. Long-distance taxi rides could be charged a flat rate or negotiated fare, in which case the driver may not necessarily use the meter to calculate price. Therefore, trips with a distance of 0 mile or duration of 0 minute were retained if the trip was a flat-rate ride and crossed boroughs. In the analytical dataset, there were a total of 228 trips meeting these criteria. The distances and duration of these trips were then imputed based on the fare amount, rate code, and the median speed of trips with valid distance and duration data.

2. Distribution of trip distance

The distribution of trip distance in the raw data is extremely right-skewed, with most data clumped at the lower range and a few records of implausibly large values (Figure 2A). After data selection, the right skewness lessened (Figure 2B). In the analytical dataset, the mean distance was 3.0 miles, with a standard deviation of 2.9 miles. The median distance was 2.0 miles, and 95% of trips traveled between 0.4–11.2 miles. Since distance is highly right-skewed, the median distance is a better measure of the central tendency of the distribution of distance.

Figure 2. Distribution of trip distance in the raw NYC green taxi data (A) and the analytical dataset (B)

Based on the histogram, I hypothesized that the distribution of distance may follow the log-normal distribution. This hypothesis is evidenced by the near-normal distribution of log-transformed trip distance (Figure 3A). This hypothesis is further supported by the quantile-quantile (Q-Q) plot of log-transformed distance (Figure 3B), in which the observed values closely track the expected values except for some deviations at the lower range.

Figure 3. Distribution of log-transformed trip distances

Exploratory Analysis of Airport Trips

In this section, the mean and median trip distance by hour of day was first reported, and the general pattern of change in median trip distance over the day was described. Then, trips to and from airports were identified. For airport trips, I examined the frequency of pickups by hour of day, their costs, and their pickup locations throughout NYC.

1. Mean and median trip distance

The mean and median trip distance by hour of day is shown in Figure 4A, and the exact means and medians are reported in Appendix 1. Mean distances were constantly higher than median distance during each hour. This is most likely because the distribution of trip distance is right-skewed, and the mean could be inflated by a few long-distance trips. As such, the median is a fairer representation of the central tendency of all distances.

Based on the median values, figure 4A suggests that the distance of a typical green taxi trip fluctuated around 2 miles during most time of the day. Trips started during 4:00 PM to 6:00 PM tended to be the shortest, and trips started between 4:00 AM and 6:00 AM were the longest. The surge in long-distance trips during the morning is likely driven by trips to the airports or other long-distance rides.

Figure 4. Mean and median distance of NYC green taxi by hour of day (A). The percent of airport trips by hour of day (B) closely matches the hourly mean and median distance

2. Identifying airport taxi rides

Airport taxi rides were considered as trips that originated or terminated at the 3 NYC area airports: John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and the Newark Liberty International Airport (EWR). I used the following criteria to define airport taxi rides:

1)With a rate code of 2 (JFK) or 3 (EWR)

2)Pickup at JFK, LGA, or EWR based on pickup taxi zone

3)Drop off at JFK, LGA, or EWR based on drop off taxi zone

A total of 35,079 trips met at least one of these criteria and qualified as airport taxi rides.

3. Airport rides as an explanation for longer rides in the morning

The surge in long-distance taxi trips during the early morning was likely driven by more airport rides during the same period. To examine this hypothesis, I plotted the percentage of airport rides of all taxi trips by hour of day (Figure 4B). More than 10% of all taxi rides during 5:00 AM were airport trips, and the percent of airport trips between 4:00 AM and 7:00 AM was the highest among all hours of day. These data support the hypothesis that the longer median taxi ride during early morning was partially attributed to more airport trips. However, airport rides may not explain the relatively higher median trip distance during late night and early morning (after 10:00 PM and before 4:00 AM), given the low percentage of airport trips during those hours. Instead, it is possible that passengers are more willing to take taxi rather than public transit (subway/bus) for longer trips at late night for safety and convenience.

4. Average airport ride cost

The overall average total cost for airport trips was $36.21, with a standard deviation of $19.83. The large variations in airport trip costs may be due to variations in trip distance. To evaluate this hypothesis, I estimated the average cost by airport and by destinations/origination of the airport trips, as is shown in Figure 5.

Figure 5. Box plot of airport taxi ride costs by airport and destination/origination (A, B), and bar plot of average cost by both airport and destination/origination ( C)

Trip fare varied greatly by airports (Figure 5A). Consistent with common sense, taxi rides to EWR were the most expensive (average $90-$120) because the trip crosses the state line, the fare is not on a flat fee basis, and passengers are responsible for toll fees. Trips to LGA incurred the lowest cost (average $20-$40), which could be attributed to its close proximity to all boroughs compared to the JFK and EWR.

Trip fare also varied greatly by its destination/origination (Figure 5B). For all airports, taxi rides to and from the Bronx were the most expensive, and trips to/from Queens were the cheapest. There was a large variation in taxi ride fare to/from Queens, which was polarized by the high costs to EWR and the low cost to LGA and JFK (Figure 5C).

5. Popular pickup locations for airport trips

To identify the hot pickup spots of airport trip for green taxi, I calculated the frequency of pickups of airport rides for each taxi zone during September 2015. In Figure 6 below, zones with more airport pickups are in darker blue. The most popular pickup zones for airport rides were in upper Manhattan near the Morningside Heights and Harlem, and in Queens near Astoria, Long Island City, Jackson Heights, Elmhurst, and Flushing.

Figure 6. Heatmap of NYC taxi zone by frequency of airport ride pickups during September 2015. The color key represents the frequency of pickups, with darker blue indicating more pickups.

Tip Percentage Prediction

In this section, I built a series of prediction models to predict the tip percentage for eligible NYC green taxi trips in September 2015. This analysis followed the logical steps as outlined below.

  • Data filtering
  • Feature engineering and data cleaning
  • Deriving response variable
  • Feature selection and transformation
  • Model fitting and evaluation

1. Data filtering

Because the tip amount was only available for credit card transactions, only trips paid by a credit card were included. In addition, taxi rides that were paid by negotiated fare (rate code = 5) without tip were excluded as well, because the tip was most likely included in the total negotiated fare amount. A total of 680,241 trips were included in the prediction analyses.

2. Feature engineering and data cleaning

The following features were extracted based on the existing data.

  • Day of week
  • Week of a month
  • Weekday (Monday to Friday) or weekend (Saturday to Sunday)

Furthermore, the following missing data were assigned values with the highest frequency.

  • Passenger count: For trips with 0 passenger, the passenger count was recoded as 1.
  • Trip type: Missing trip type was coded as street-hail when the trip type is missing.

3. Definition of tip percentage

Tip percentage was calculated the following:

The distribution of the tip percentage is shown in Figure 7. The mean tip percentage was 14.2% with a standard deviation of 7.0%. The tip percentage does not follow a normal distribution and is centered around a few typical values, i.e., 0%, 16.7%, 20%, 23%. The Q-Q plot also suggests that data are highly clustered at the lower and upper range of the distribution. Interestingly, these values are equivalent to 0%, 20%, 25%, and 30% of the base fare, respectively, which are the preset tipping options shown at check out. This suggests that many passengers chose the preset tipping percentage or skipped tipping at all, instead of selecting a custom tip percentage.

Figure 7. Distribution of tip percentage

4. Feature selection and transformation

All features in the dataset were considered candidate features for predicting tip percentage, except for “tip_amount” and “total_amount”, which are directly used to derive tip percentage. Including the total amount and tip amount as features will lead to spuriously good prediction accuracy, which is also known as “data leakage”. The pickup/drop off taxi zones were used for modeling because they are more interpretable than the longitudinal and latitude of locations. In addition, the feature “ehail_fee” contains null values for all records, “store_and_fwd_flag” is irrelevant to tip percentage, “payment_type” is uniform for all observations, and therefore were all excluded from the model.

Figure 8A. Scatter plot of tip percentage and continuous features

Figure 8A shows the relationship between the tip percentage and continuous features. Figure 8B shows the relationship between tip percentage and categorical features. Because information about taxi trips are limited, these features were all included in the prediction models. Trip distance and time were not linearly correlated with tip percentage and were converted into categorical variables. Trip speed was highly correlated with speed and time, and therefore were not included in the prediction model. The features below were included in the model.

Figure 8B. Bar plot of tip percentage and categorical features

Continuous features

  • Fare amount
  • Extra charge
  • MTA tax
  • Tolls amount
  • Improvement surcharge

Categorical features

  • Vendor ID: 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
  • Trip type: 1=street hail, 2=dispatch
  • Pickup hour: grouped into every 2-hour window (i.e., 1:00 AM — 2:00 AM)
  • Trip distance: 0–1.31 miles, 1.31–2.46 miles, 2.46–4.50 miles, and 4.50+ miles
  • Trip time: 0–7 minutes, 7–12 minutes, 12–19 minutes, 19+ minutes.
  • Pickup location: Bronx, Brooklyn, Manhattan, Queens, Staten Island, JFK, LGA, EWR airports
  • Dropoff location: Bronx, Brooklyn, Manhattan, Queens, Staten Island, JFK, LGA, EWR airports
  • Week of September: week 1–5
  • Weekday or weekend: 1=weekday, 2=weekend
  • Rate code ID: 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare
  • Passenger count: 1–2, 3–5, and 5+ passengers

All categorical variables were converted into k-1 level dummy variables in the prediction model.

5. Model fitting and evaluation

A few models were compared to select the best fit. Model fit was evaluated using standard metrics in a train-test setting. Training and testing dataset were split in a 7:3 ratio. Models were first fitted in the training dataset and then applied to the testing dataset to obtain fitting statistics. The goodness-of-fit of the models were assessed by the Root of Mean Squared Error (RMSE) of tip percentage and variance in tip percentage explained by the model (R2).

RMSE. RMSE is an estimate of the standard deviation of the random component not explained by the model, and is defined as:

Here Ŷi is the predicted value from the model, Yi the observed data value. RMSE closer to 0 indicates better model fitting. When the RMSE is less than the standard deviation of the tip percentage, the predictive model is considered useful.

R2. R2 is the square of the correlation between the response variable and the predicted responses, which can be calculated as:

Here Ŷi is the predicted value from the model, Yi is the observed data value, and Y̅ is the mean of the observed data. R2 ranges between 0 and 1, with a value closer to 1 indicating that a greater proportion of variance is accounted for by the model. For example, an R2 value of 0.80 means that the model explains 80% of the total variation in the response variable.

1)Multivariable linear regression

Linear regression with the ordinary least square method was first used to predict tip percentage. The full model with all features was compared with a baseline model with intercept only, which simply used the mean as the predicted value. Compared to the baseline model, the full model did not explain too much additional variance in tip percentage (R2=0.040). The RMSE was estimated to be 6.87%, meaning on average the predicted tip percentage will be ±6.87% of the actual percentage. For example, if the predicted tip percentage is 20%, the actual tip percentage will most likely fall between 13.13% and 26.87%.

A summary of the linear regression model is shown in Appendix 2. Taxi trips with the following features accrued higher tip percentage: lower fare and extra charge, higher toll fee, shorter duration and distance, more passengers, picked up after 6 PM, picked up at LGA, dropped off at all airports, Manhattan downtown, and Staten Island, during the weekend, and equipped with the Creative Mobile Technologies meter.

Impact of extreme data. There were 514 trips with extremely large tips that were > 50% of total cost. A sensitivity analysis that excluded the extreme values slightly improved the model fit (Table 1), which suggests that the model’s performance was partially hindered by its ability to accurately predict large tips (i.e. tip percentage > 50%).

Polynomial model. A polynomial model was fitted by assuming a quadratic function between continuous features and the tip percentage. The model fit was slightly improved (Table 1), suggesting that the non-linear relationship between features and response only had a minor influence on model fit.

2) Regularization and gradient boosting

A few advanced models were applied to predict the tip percentage. Ridge regression was used to penalize the influence of features that have low predictive power but contribute high variance. Gradient boosting was used to minimize the residuals by iteratively modeling the residuals. Additionally, XGBoost was applied because it imposes additional regularization to gradient boosting which is prone to overfitting. Model parameters were determined by cross-validation and grid search using the training dataset. Model fit was assessed in the testing dataset.

Table 1 summarizes the performance of all models. Because the more advanced models only lead to marginal improvement in model fit, the linear regression is then preferred due to its fast implementation and high interpretability.

6. Summary and future direction

In sum, the multivariable linear regression achieved unsatisfactory performance as measured by R2 and RMSE. Model fit was not improved significantly by regularization or gradient boosting methods. Therefore, the prediction models are not recommended to make predictions for tip percentage for green taxis in NYC. Nevertheless, based on the beta coefficients of linear regression model, taxi rides with lower cost, shorter duration/distance, and to/from airport and Manhattan downtown, during the evening and weekends may yield higher tip percentage. Notably, taxis that used Creative Mobile Technologies meters generated higher tips compared to VeriFone Inc. meters. This observation is congruent with news reports that found Creative Mobile Technologies uses the sum of base fare and tax/fees to calculate tip percentage on the payment interface, while VeriFone uses the base fare only. To improve the predictability of tip percentage for NYC green taxis, I would like to tackle the following issues in future analyses.

Why is the model fit so bad?

Figure 9. Diagnostic plot of linear regression

Diagnostic plots of the linear regression suggest that that the residuals are neither normally distributed (i.e., violation of normality assumption) (Figure 9A) or stable over the predicted values (i.e., violation of homoscedasticity assumption) (Figure 9B). The poor model fit could be due to a few factors.

1) It makes sense that a model with the mean tip percentage (14%) performed almost as well as the fully loaded model, because there is a social norm of tipping around 15–20% in the US and people tend to tip within this range.

2) Although the tip percentage was highly centered around the social norm (15–20%), there are some extreme values that disrupt the distribution, mostly by the 0% tippers. These extreme values contribute to the large prediction errors and worth further examination. For instance, some of the 0% tips could be artifacts when tips were paid in cash.

3) Key information that are predictive of tipping behavior are not available. For example, socio-demographic information about the passenger, past tipping rate of the passenger, driver’s driving style, and car condition, etc.

Preset vs. custom tip percentage

As discussed before, the tip percentage are highly clumped at certain preset values, which significantly affect the normality of the residual distributions which is a key assumption for linear regression. In the Figure 10, I compared the distribution of tip percentage of all trips (Figure 10A) vs. trips with non-zero tips (Figure 10B), and vs. trips with custom tip percentage (i.e., tip percentage other than 0%, 16.7%, 20%, 23%) (Figure 10C). The distribution of tip percentage is closer to a normal distribution after removing the preset tip percentages, with a distinct shape (skewness=1.7, kurtosis=7.1) compared to the distribution of all tips (skewness= -0.6, kurtosis=1.5). Therefore, tipping by the preset or a custom percentage appear to be two separate processes and can be modeled in a two-part model. In such models, step 1 will be a classification model to predict preset or custom tip percentage. In step 2, preset tippers will be further classified into different preset rates; for custom tippers, the exact tip percentage will be estimated in a prediction model.

Figure 10. Distribution of tip percentage before and after removing certain preset percentages

Predicting tip amount as the response

In a final attempt to improve the prediction accuracy, I used the tip amount (in dollars) as the response variable, which roughly follows a log-linear distribution. Using the same set of features, a multivariable linear model explained a significant portion of the variance in tip amount (R2 = 0.477). However, the RMSE remains large (RMSE = $1.70) which translated to an RMSE in tip percentage of approximately 6.9%. Therefore, directly modeling the tip amount did not improve the prediction accuracy for tip percentage.

Analysis of Trip Speed

In this section, I will first compare the average trip speed across the 5 weeks of September 2019 using the one-way analysis of variance (ANOVA) test. I will further test whether average speed also differs between weekday/weekend in addition to the week of month using the two-way ANOVA test. Finally, I examined the average travel speed over the 24 hours of day and tested the hypothesis that traffic affects hourly average speed.

1. Average speed of taxi trips in September 2015

The average speed was calculated as trip distance/trip time. For all taxi trips, the mean speed was 14.0 MPH, with a standard deviation of 6.2 MPH. The median speed was 12.7 MPH, with 95% of all trips falling between 5.9 to 30.0 MPH.

2. Average trip speed by week

To test if the average speed was the same among the 5 weeks of September 2015, I performed a one-way ANOVA test of average speed by week. The null hypothesis of ANOVA was that all weeks had the same mean speed; the alternative hypothesis was that at least one week had a different mean speed than the other weeks.

The results are presented in Table 2, which suggest that the null hypothesis was rejected. Therefore, the average speed was not the same for all weeks in September. Post-hoc pairwise comparison shows that the average speed between any two weeks were different, with the week 2 and 5 having the slowest traffic (Appendix 3).

It is possible that the Labor Day weekend during week 2 may have caused some congestion during the first 2 work days of that week. Although there was no holiday or major event during the last week, I hypothesize that it was because only 3 weekdays were included in that week, and weekdays tend to have slower traffic in general.

To test the hypothesis that weekday and weekend taxi rides have different mean speed, I performed a two-way ANOVA by testing the main effects of the week of the month and weekday/weekend on the response variable, mean speed. The analysis assumed no interaction between two factors. The result is summarized in Table 3, which shows that trip speed was statistically significantly different between weekday and weekend, independent of the week of the month. Weekend taxi rides on average were 1.04 MPH faster than weekday rides (95% confidence interval: 1.01–1.06 MPH). Compared with the one-way ANOVA with the week of month as the only independent variable, week of month explains fewer sum of squares in the two-way ANOVA after accounting for weekday/weekend variable. This is because some of the between-week variation in speed is attributed to the weekday vs. weekend difference. For example, week 5 included only weekdays, thereby having a lower average speed compared to other weeks which included both weekdays and weekends.

3. Average trip speed by hour of day

The mean trip speed over the 24 hours of day is shown in Figure 11. It appears that trips in the early morning and late night were the fastest, and trips during morning and evening rush hours were the slowest. Stratified plots by trip direction, day of week and week of month show that the change in mean speed over time followed similar patterns across these groups. Therefore, taxi drivers can expect faster rides before 8 AM and after 6 PM.

Figure 11. Mean trip speed over the 24 hours of day (A) and by trip direction (B), day of week (C ), and week of month (D)

One obvious explanation for the hourly change in speed is that the average speed slows down as the traffic becomes busier. To test this hypothesis, I first examined the relationship between hourly mean taxi speed and hourly total taxi pickups. Figure 12 is a scatter plot of the hourly mean taxi speed and hourly total taxi pickups, for each hour of the 30 days of September. Not to my surprise, hourly mean speed changed linearly with the number of total taxi pickups, meaning average speed was slower when there were more taxi pickups per hour. To formally test the association between them, a linear regression was fitted with hourly taxi pickup as the independent variable, and hourly mean speed as the dependent variable. The result shows that hourly pickup alone explained 45.7% variance in hourly mean speed. The model estimates that, for every 1000 more taxi pickups per hour, the hourly mean speed will slow down by 1.8 MPH (95% confidence interval: 1.7–2.0, P<0.001).

Figure 12. Relationship between the hourly total taxi pickups and mean speed

4. Summary

In sum, on average the NYC green taxi ran at a speed of 14 MPH during September 2015. A few factors influenced the taxi speed:

  • Speed differed by week, with the Labor Day week and the last week of September having the slowest speed.
  • Speed was generally faster during the weekends, compared to weekdays.
  • Throughout the day, speed was faster during early morning and late night, and the slower during morning and evening rush hours.
  • Hourly average speed decreased linearly as the total number of pickups increased.

These findings can be used in cases such as arrival time estimation and general driving planning, which may significantly improve forecast accuracy and achieve overall better travel experience.

--

--

Jiamin Han

Data scientist with passion to solve real life problems