Is Uber Taking Over New York City?

Using individual trip-level data to answer important questions on ride sharing services’ influence on markets, drivers, and citizens

Jack Lundquist
data.tale()
Published in
9 min readApr 1, 2018

--

Jack Lundquist and Andrew Nell, NYU CUSP ‘18

Much has been made in the past few years about newly influential, “disruptive” companies like Uber, Airbnb, Grubhub, and TaskRabbit, all of which seem poised to transform their respective sectors. The rise of the sharing economy has brought with it an array of important questions, ranging from the economic to the moral to the political. To some extent, the reason questions about the impact of these disruptive economic forces remain so relevant is due to the lack of information we have about them. This is in part because many of these companies are new phenomena in our societies and economies, and in part because of their resistance to regulatory oversight. It is also, quite simply, because we have not had the data to answer even the most basic question about the sharing economy: to what extent have these companies immersed themselves into the fabric of our lives, both as workers and consumers?

There have been some great attempts to answer these questions by FiveThirtyEight and some civic hackers, but their attempts are limited by the data being used. Apart from the data requisitioned by FiveThirtyEight (which covers a 6 month period in 2014 and another 6 months in 2015) and aggregated weekly data published by the New York City Taxi & Limousine Commission TLC), individual trip data by vehicle company remains elusive.

Our team found a way to use the publicly available data sets on taxi trips to acquire individual trip level data for each vehicle service. We have published our code on Github so that those interested in ride sharing services can go about answering the important questions they may have about this rapidly changing sector. In this article we will discuss our own analysis of this dataset, and will present a brief overview of our data integration process for those interested.

Our results: ridesharing is on the rise, but yellow taxis maintain a stronghold downtown

Figure 1: Taxi demand per service type, 2015–2017 (rolling 28 day mean)

Perhaps unsurprising to most, our analysis reveals that Uber has overtaken yellow taxis as the primary for-hire vehicle (FHV). Uber overtook yellow taxis as the largest holder of market share in the summer of 2017, and looks as if it will remain the largest for the foreseeable future. Yellow and green taxi trips have been declining at least since January 2015 (the cutoff for this analysis), although it looks as if a seasonal pattern in yellow taxi trips (more taken in the colder months and less in the summer, apart from the holiday season drop-off) remains. While Via and Lyft have a significantly smaller market share than Uber, the giant of the new ride-sharing services, they too have seen consistent growth since their data were included in the TLC FHV dataset around May of 2015. Both Lyft and Via now have a larger market share than the green taxi services, likely hampered in part by the geographic restrictions on green taxis. The overall number of trips for all services has increased from 2015 to 2017, with the increase coming from ride sharing service trips. This suggests that these new ride sharing services, perhaps through convenience or pricing, have been successful at increasing demand in for-hire vehicles in New York City and potentially encouraging defection from other modes of transport.

Figure 2: Dominant taxi service by taxi zone (2015–2018). Service dominance of a zone is defined as the service with the most trips taken that month in that zone.

While Uber has become the preferred ride sharing service by most in New York City, yellow taxis have maintained a stronghold in Midtown and Downtown Manhattan, where a disproportionate number of New York City’s for-hire vehicle trips are taken (see below). This could be for any number of reasons, but may have to do with the near-ubiquitous presence of yellow taxis in these parts of Manhattan. It is not that time-consuming to hail a taxi in downtown or midtown, because so many taxis are around, while an app based service may take time to arrive for pick up. Outside of this dense cluster, taxis are less omnipresent, and therefore it is much harder to search for and successfully hail a taxi. The convenience and ease of using Uber’s platform to request a vehicle seems to have contributed to its ability to overtake yellow taxi and other services as the dominant sharing service in the less taxi-dense portions of New York City.

Figure 3: Overall taxi demand by taxi zone (2015–2017). Note the high volume of trips in zones containing JFK and LaGuardia airports (the bright zones on the eastern side of the map)

By late 2017, it does seem like Uber had begun to penetrate that stronghold, but whether it is able to overtake yellow taxis as the dominant ride sharing mode in the New York City’s largest hub of employment remains to be seen. Yellow taxis have also maintained dominance in both zones where the city’s airports are located, though Uber is clearly gaining ground in these zones. The reason for this may be similar to the reason they have maintained dominance in downtown and midtown: an omnipresent and conveniently accessible fleet, particularly for international travelers who may not have access to data or internet that would allow them access to app-based services on arrival.

Figure 4(a) and 4(b): Biyearly trip volumes by taxi zone for Uber and yellow taxis, 4(a) and 4(b) respectively (2015–2017). While Uber has less volume in any one zone than yellow taxis do, it steadily increases its volume across taxi zones while yellow taxi volume declines steadily.

Uber has seen unprecedented levels of increase in taxi demand in areas outside of the traditional Manhattan taxi hub, as can be seen in Figure 2, where large parts of Brooklyn, Queens and Upper Manhattan are seeing increases in ridership. This could partially be attributed to phenomena such as gentrification, though it may simply be a symptom of the shift in ridership from other modes such as subway and bus to FHV based on cost, reliability and accessibility. Whatever the cause, it certainly warrants more detailed research.

These insights — that Uber has become the dominant for-hire vehicle service; that for-hire vehicle demand has increased with the advent of ride sharing services; and that yellow taxis remain dominant in Downtown, Midtown and around airports — while interesting, are not novel. They could be generated using the summary datasets provided by the TLC, and are pretty intuitive for New Yorkers. The value of our work is in the generation of a much more granular record of for-hire vehicle trips taken in New York City, which allows for questions more fundamental to the lives and livelihoods of people to be answered: questions related to specific mobility patterns at various times of the day, questions around the defection of riders from one service to another, questions around the costs and benefits of different services to both drivers and riders, questions around the use of mode shift to explain neighborhood change, and more. Stay tuned for an update to this post, where we will analyze some of these questions and discuss the significant implications.

Data Processing

Datasets used in this analysis are listed below.

TLC Trip Record Data: Individual trips taken by yellow taxis, green taxis and for-hire vehicles. Yellow and green taxi data starts in 2009, and FHV data starts in 2015. Features include time, distance, cost, and (depending on vehicle type and month/year) either longitudes and latitudes or taxi zone associated with the pickup and drop off points. FHV data excludes any additional characteristics other than pick up taxi zones, however drop off zones are included post 2016.

Taxi Zones: Spatial boundaries established by the TLC in order to track trips in a more anonymized way than point-based locational data. This is the spatial scale many of the analyses of taxi and FHV data takes place on.

For-Hire Vehicle Bases: From the TLC website: “A New Livery Base is a TLC licensed business that dispatches TLC licensed for-hire vehicles designed to carry fewer than six passengers, excluding the driver, which charge for service on the basis of flat rate, time, mileage, or zones.” These base names contain information on the ride-sharing company (e.g. Uber, Lyft, Via, etc…) licensing the vehicles in question, which is what allows us to link trip data to specific ride-sharing companies.

Initially, when trying to source taxi datasets in NYC, the first ‘port of call’ is NYC’s Open Data Portal and the TLC’s data page. The standard data sets are separated into yellow taxi records, green taxi records, and records on any other for-hire vehicle trip. The yellow and green taxi data sets are rich with information: pick up and drop off coordinates, dates and times, fares, tips, etc. However, FHV datasets are much sparser. Data provided for each trip are the taxi base number (the “home base” of each taxi service), pick up date-time and pick up taxi zone (in INSERT DATE the TLC started to include drop off taxi zone and date-time, as well). With these features alone, it is impossible disaggregate these trips by ride-sharing service, making it impossible to measure the market share of Uber or any other ride-sharing service. In order to perform this important disaggregation, an additional dataset is needed.

This dataset is the registry of FHV taxi bases. Each taxi service is registered to an individual base that is responsible for dispatching for-hire vehicles to destinations to pick up passengers. In the traditional taxi cab environment, this would be the base that one would call to request a taxi. Similarly, when app based services such as Uber started to become regulated in NYC, they were required to follow the same protocol as other FHV services already in existence and register to specific bases.

A schematic outlining the process to go from the TLC’s dataset of trip records to a record of trips taken by location and for-hire vehicle type. Each trip record contains a pickup zone (purple), which can easily be linked a shapefile of taxi zones (orange, both spatial data and therefore easy to link together) and thereafter plotted on a map. The Base ID (blue) of each FHV trip can be linked to the registry of taxi bases, which can then be linked to its corresponding “Alternate Licensee Name” (red), which contains information on the specific for-hire vehicle company.

The TLC conveniently supplies a registry of all registered FHV taxi bases (commercial, luxury and black) and data about these bases. There are over 800 taxi bases registered and upon initial inspection, there are no bases with any names that relate to well-known app based taxi services. One feature in this dataset is the ‘Alternate name of Licensee.’ While it is often empty (over 75% of data points), in the case of most ride-sharing services it includes the names of these services along with the name of the base.

This means that, for every trip record linked to a base that includes the name of a ride-sharing service in its “Alternate name of Licensee” column, one can deduce which trips were conducted by which service. However, given that many of the bases in the registry lacked information on the “Alternate name of Licensee”, it could be the case that the trips one is able to link to a ride-sharing service are only a small piece of the total trips made by drivers working with that service. To verify the completeness of this method, we used a dataset released by Uber after a FOIL request by FiveThirtyEight. The bases from the FiveThirtyEight dataset and the Uber bases taken from the TLC-provided registry of bases were identical, except one Uber base missing from the TLC list. We believe this base is no longer in service, and hence was off the TLC list when we conducted our analysis in October 2017. This means we can say with reasonable confidence that the TLC registry contains information on all the Uber registered bases (and the likely the bases of other ride-sharing services).

By combining the taxi base data set with the FHV individual trip data set we can determine exactly which trip belongs to which service. Furthermore, by combining the taxi zone data set (also available from TLC) with the yellow and green cabs we could deduce which taxi zone these cabs were operating in, allowing for a direct comparison between this traditional NYC taxi cab services and the newer “disruptors”. Stay tuned for our team’s more detailed comparison of these competing services!

Note: the integration and analysis described here summarizes and builds upon the work of a larger group project conducted for the course “Applied Data Science,” a course at NYU CUSP we were enrolled in during the fall 2017 semester. Thanks to Prince Abunku, Maham Khan and Lior Melnick for their contribution to this project!

Originally published at medium.com on April 1, 2018.

--

--

Jack Lundquist
data.tale()

Using data science to promote social good through analysis and advocacy.