How far do people commute using Bike Sharing Systems?

An analysis of 2017 Citibike trip distance

Juan Sokoloff
data.tale()
6 min readMay 31, 2018

--

Bike Sharing systems are a combination of two major trends occurring in the world: first, the “shared economy” phenomenon, and secondly, the increasing usage of bicycles as a mode of transportation in urban environments. These systems take advantage of the increasing demand for bicycles, while providing a shared service that does not require ownership of a bike for its users. Being able to determine the limitations of existing bike sharing systems would allow cities to better allocate resources and improve the service. Also, it would allow cities to amplify the niche market that these type of services could have within the mobility demands of a large city, since improving the system is likely to result in more customers opting for the bikesharing alternative.

I live in New York City (NYC) and wanted to understand how Citibike, NYC’s bike sharing system, is used as a commuting service. I’m particularly interested in one of the most obvious limitations of biking: long distance trips. The motivation for this exercise is to predict a “maximum acceptable distance” for trips using Citibike, a metric that could later be applied to bike sharing systems in other cities of the world. For instance, if the maximum acceptable distance for a given city is X, but 80% of the residents have a commuting distance higher than X, you could potentially only have a 20% usage rate for the service (disregarding other variables that could impact the rate).

Analysis and results

In order to do calculate the maximum acceptable distance I used all the trips performed during 2017 using Citibike (around 17 million) as an initial dataset. By eliminating the trips taken during the weekends and by unregistered users, I ended up with a dataset consisting of people who use the system frequently (registered) and during workdays (Monday-Friday); these are the users I considered as “commuters”. Having prepared the data in such way I started digging into my questions, the first being about time: for how long do Citibike commuters bike for?

Fig1: Histogram, number of trips by trip duration

As it can be seen in Fig.1, most of the trips last for relatively short periods of time. To be precise, the 50th percentile is at 9.9 minutes and the 75th percentile at 16.1 minutes, showing a clear preference of commuters to perform short trips when using Citibike. The following question is: is there a relationship among the distance between stations (euclidean in kilometers) and the time that it takes users to bike between them?

Fig2. Linear relationship of Distance and Trip duration of a random sample of 10.000 trips

In order to generate Fig2. I selected a random sample of 10.000 trips from the data; this allowed for statistically significant results, while reducing computational time. It is good for us that, according to the sample of 10.000 observations, there is a relationship between travel time and distance. Nevertheless, we can also see that as the travel time increases so does the error. This is because you can either have users that bike a long time to come back to the same station (circular trip) or users that use their time to go as far as they can (linear trip). Of course, as the time increases, it is likely that the difference in distance between circular and linear trips augments. Regardless of this, we have a decent fit, showing us that the longer the trip is (time) the further the user went. In order to confirm this we take a look at a random station (with over 19.000 trips) to see if, from the trips generated in that dock, the closer docks are recipient of more trips than the further ones (the bigger the blue dot is, the more trips the dock received from the starting dock throughout 2017).

Fig3. Origin-destination trips in random Citibike dock, where the size of the destination point represents the number of trips received from the starting dock.

As can be seen in the previous image, it appears that distance is relevant to the number of trips performed between stations. To confirm this statistically, we take another random sample of 10 origin docks (10 red dots, like the one from Fig3, each above the minimum threshold of origin trips and located in different neighborhoods of Manhattan) representing over 500.000 trips in order to make a linear regression between distance (in kilometers) and number of trips to the destination docks. These are the results:

Fig4. Linear fit for linear model predicting number of trips given a distance between stations.
Table 1. OLS regression results

As seen in the model, as a whole, the distance between docks can only explain 15.5% of the total variance in the number of trips. Nevertheless, it is important to note that as the distance increases the error also decreases, meaning that the longer the trip is (in distance) the better you can predict the number of trips by using only the distance between stations as the explanatory variable. This is useful when the objective is to find a “maximum acceptable distance”. Following the model parameters (both significant) we know that the intercept is 230.06 trips and for every kilometer you increase between stations you lose 51.47 trips, allowing us to calculate the expected distance at which the number of predicted trips is zero: 4.46 kilometers. To confirm these results I performed a Spatial Kernel Density Estimator (SKDE) analysis for each of the 10 selected stations. Put simply, the SKDE is a calculation of the density of observations inside a kernel (space), so if it finds that there is a higher concentration of trips in a given area of the space, then the results will be significant. If, on the contrary, it finds that the density of trips is randomly distributed through space, the coefficient will be non-significant. For the stations selected the results are significant at a 5% alpha for all starting docks, where the higher density of trips is located near the starting dock. Below is an example for the same station used in Fig3.

Fig5. Spatial Kernel Density Estimator for randomly selected starting dock.
Table 2. Spatial Kernel Density Estimator results for 10 randomly selected starting docks (the same used in the OLS model)

Discussion

The success of Citibike cannot be understated; generating over 17 million trips in the past year it clearly has a positive impact on NYC mobility and environment. Nevertheless, it still accounts for only a small fraction of all the trips that occur in a year in NYC, primarily because the citizens prefer other modes of transportation. Of course, one of the main reasons for users to choose other modes is that the distance of their daily commute is too long, and there likely exists a better way to travel. As shown above, if the distance between stations is more than 4.46 km, no trips will occur (at least according to the linear model).

As you may know, when a location has a high density of jobs and households it is likely that the distance between work and home is short. Citibike is successful because New York, and particularly Manhattan, is very dense, both in housing and jobs. For example, according to the Census Bureau Manhattan has 2.408.160 jobs (in 2015) in an area of 59.1 squared kilometers, which gives a density of 40.747 jobs per square kilometer on average. This is an extremely high job density (the highest in the US). If we understand Citibike under this context, then you can see how dense the job market has to be for bike-sharing systems to actually be an alternative to commuters. Understanding this is key for cities that want to implement this kind of system; if they want to have a successful system (in terms of number of users) it is a requisite to have an average commuting distance that is short. In the case of New York, the maximum distance is 4.46 kilometers, although this number could be different in other cities (though I would not expect it to vary too much). In order to prove this, further work could be done using Santander Bike data from London.

--

--