Data Footprint of Bike-Sharing in London
IoT and Transport Big Data Analysis
In 2005 the city of Lyon introduced what later would be called the third-generation of bike-sharing. This is one of the greatest transport innovations with secondary benefits to health and environment, and I believe a linchpin to sustainable urbanization. Nowadays, one million bikes are part of bike-sharing programs in the world, with a steep upward trend.
The foundation of the third-generation of bike-sharing is digital and communication technology that is finding an increasing use in the so-called Internet of Things (IoT). At each location where bikes are picked up, the individual bikes are tracked, such that each rental generates a digital footprint: which bike, from where, to where, for how long, at what date and time, by whom. The physical things and the use thereof are mirrored in a database, which grows one line per rental.
The city of London introduced a bike-sharing program in 2010, which by 2015 included 742 bike stations, 11,500 bikes and 10 million annual rentals, thus making it the second largest bike-sharing program in Europe and part of the global top-ten (Chinese cities have become the giants of bike-sharing programs in recent years). The bikes in London are called Santander Bikes after the Spanish bank that sponsors the program since 2015, but they are also known as Boris Bikes after Mayor Boris Johnson who championed the program during his tenure as mayor of London.
Data is an asset and a source of insight if handled right. Open data programs are ways in which governments, NGOs, academia and companies make part of their data available in a more or less consistent format to the public. The city of London has made the data on bike-sharing open, among other transportation data. In what follows I illustrate some (far from all) ways in which the analysis and visualization of transport data from IoT technology can generate high-level to fine-grained insight.
A Comment on Approach to Analysis
In my experience, the most informative data analysis begins with the discovery of general trends involving a minimum of variables. The analysis next moves to find contrasts between the general and the increasingly specific in order to understand the variations and heterogeneity in the data. As this is done the data is progressively segmented over the different variables, and unless there is ample data to support the effort, conclusions become increasingly speculative, but still useful for hypothesis creation to guide further study and experimentation.
I will follow this approach below as I analyze, visualize and progressively attain a more detailed understanding of bike-sharing in London in 2015. In order to quicker reach the more complex and interesting analysis, I will however leave out some details in the first part.
Weekdays and Weekends and Hour of the Day
Heat maps are useful means to visualize coupling between two discrete variables. The heat map below shows how the total number of bike rentals distribute over weekday and hour of the day.
Saturday and Sunday show a distinct pattern compared to Mondays to Fridays. Weekend usage is spread out from noon to late afternoon. In contrast, weekdays are bi-modal: large peak of rentals between 7 and 9 in the morning, with another peak around 17 and 18 in the evening. If we group the two classes of days and create the average daily use per hour, the two histograms below are obtained.
A hypothesis to explain this contrast comes to mind: weekday usage is mostly for commuting to and from work, while weekend usage is mostly for middle-of-the-day leisure trips including some late night trips from pub to home. I will revisit this hypothesis later.
Impact of Winter Gloom on Work and Play
Month to month variations in rentals is visualized below. Given the generous amount of cold rain during London winters, and the quite unpleasant feeling of soaked trousers, it does not surprise to observe the reduced number of bike rentals during winter months.
There is also a minor but significant trend that weekend rentals are affected more by season than weekday ones. During the winter months, weekend rentals constitute one-fifth of total rentals, while in the summer months, these types of rentals constitute one-fourth of all rentals.
It again suggests, but does not prove, that weekend rentals are more leisurely in nature, and hence more “elastic” to perturbations from wind, temperature, precipitation or amount of daylight. On the other hand, the unfavorable conditions of winter are instead dealt with a stiff upper lip when going to and from the workplace.
A View of The Network of Stations and Clusters
Each bike rental has a start and end station. A rental therefore defines an edge in a directed network. The visualization of complex network structures is as much art as science, and the image below is no exception.
Each circle in the image represents a node in the network, in other words one of the 742 bike rental stations. The size of the circle depends on the number of other rental stations connected to it — a property called the degree in graph theory. Each line represents rentals that begin and end in the two stations, where width depends on the number of rentals that connect the two stations, and the color depends on the median time of day the rentals take place (green = most trips early in the day, red = most trips late in the day). Any connection that includes less than 50 rentals is not shown. Still, two-thirds of all trips are accounted for with just 15% of all edges in the network. Finally, the nodes that are highly connected have been put close together.
This representation suggests there is one large cluster of stations centered around two dominant source stations with both early and late connections. There is also a smaller cluster at the top of the image of densely connected stations with median times in the middle of the day. There are also several peripheral stations, which mostly form connections with a small number of other stations without forming distinct clusters. There is no subset of nodes that is fully separated from all other nodes, though — a property of the network called strongly connected.
An image like this gives a high-level view that points in directions to look for contrasts. These types of images can be deceiving though. One of the many possible rearrangement of nodes is almost certain to confirm our biases about the data rather than challenging them. Visualizations are indispensable in data analysis, yet their flexibility is also their weakness. Careful judgement and statistical analysis, where possible, is the antidote.
What follows next is a detailed view on select properties of the network of bike-sharing rentals of London.
Center of the Action: City Centre, Westminster and King’s Cross, Waterloo Station
Since each station has a location in the city, an intuitive way to illustrate the “transactions of bikes” between the nodes is shown in the dynamic image below.
On the map of London, each station is shown as a pie chart. The size of the pie depends on the number of rentals that begin or end at that station. The pie contains two colors: green for rentals that start at that station, pink for rentals that end at that station. Finally, the image is a GIF, which show the dynamics over the hour of the day in a typical weekday.
First, the bi-modal quality of weekday bike-sharing is visible: a big burst of rentals in the morning, and another burst of rentals in the evening. Second, as in the abstract network image there are two stations that stand out by being the largest of all both during morning and evening. They are:
- King’s Cross railway station to the north, a major hub in the London transportation network — evidently Santander bikes included.
- London Waterloo railway station south of the Thames, it too a major hub in the London transportation network.
Unlike previous visualizations, this image captures joint properties of space and time. If we “stand back” and view the dominant color at a section of the city, areas that are mostly receiving or being a source of bikes become clearly visible. A few observations:
- The two railway stations are dominant as sources of bikes in the morning, while the City Centre on the north shore of the Thames and Westminster “bloom in pink” during the morning hours. The City Centre is the home of many businesses in London, including the finance center, and Westminster is where the government of Great Britain is situated.
- The inverse relation is observed in the evening, when the railway stations become big pink disks, and the City Centre and Westminster turn green.
- During the middle of the day, the stations in and around Hyde Park see the most traffic, quite evenly divided with respect to green and pink.
- Late in the day, most stations show little activity. The two areas in which activity remains somewhat longer are Soho (east of Hyde Park) and Shoreditch in the east part of London, which are areas known for their entertainment.
An identical visualization of the typical weekend is shown below.
Again trends we already have established are visible. We also learn that weekdays see far more activity around Hyde Park and Soho, rather than in the City Centre. Again, we see Soho and Shoreditch having a prolonged activity as far as bike rentals are concerned — possibly a sign of that beers and bikes are not mutually exclusive.
If we aggregate the number of rentals, per station and segment by weekday and weekend, and sort by count, the histogram below is obtained (after truncation). It reaffirms what the above analysis have found.
No Trajectory But Still An Idea of Direction
The rental data recorded by the London bike-sharing program only include when and where the trip started, and when and where it ended. In between these two points, no data exists. This precludes a study of the travel trajectory of the London cyclists. However, some idea of the direction of the flow can be found from the relative location of start and end stations.
The image above shows for each station and each hour of the day the dominant direction of end stations relative to the given start station. For example, we see a large triangle in the morning for the King’s Cross station. The triangle is pointing south, which represents that most trips started in the morning at King’s Cross ends in a station south (at a bearing between 135 and 225 degrees) of King’s Cross. For stations that divide its trips evenly between directions, the dominant direction is not as dominant. This property is visualized by the saturation of the color. A high saturation means that nearly 100% of all trips are in an identical direction, while a faded color means trips have no single dominant direction.
There is a great deal of heterogeneity in this image, but it affirms the idea of how the two railway stations are sources for morning commuters going to the City Centre and Westminster. It is interesting to note that in the evening, the rentals made in the City Centre are numerous, but without a single dominant direction. This suggests that bike rentals at the end of the workday in the City Centre are not only for trips back to the railways stations.
Preference For Bike Rides: Less Than Two Kilometers
Without trajectories of bikes in transit, the travel distance can not be known. But again, a rough idea can be attained by determining the distance as the crow flies between start and end stations. The blue line below is the histogram of that distance.
The distribution peaks at 1.2 kilometers (0.75 miles), which for reference is about the same distance as from Big Ben to Buckingham Palace.
In order to fully appreciate this data, a baseline is needed to compare against. As I have discussed elsewhere, defining baselines for a comparison is not trivial, but important. For this case I choose the following baseline: let the 10 million rentals start uniformly over London, and end at a station that is chosen uniformly and independently of the start station. Under these uniform and independent conditions, the distribution of travel lengths shown in orange above is obtained.
With a baseline defined, we can make further sense of the blue curve. It clearly shows that actual bike trips tend to be displaced towards shorter separations. Trips above two kilometers certainly occur, but they are less frequent than the baseline. A reasonable hypothesis is that for trips above two kilometers there is a preference among bike users to use other means of transportation (e.g. bus, subway). Bike-sharing is a way to “fill in the gaps” between subway/bus stations and work. There is an important deviation, though: the relatively high number of trips that start and end in the same station, what in graph theory is called loops or self-loops. In part, they can be explained as leisure trips.
Hyde Park — A Small World of Its Own
In the artistic network image above, there was one large and one small cluster visible. The small cluster are found to be stations in and around Hyde Park. Because they are close to each other we expect, from the previous analysis, that they will be connected — so no surprise there. But it turns out, the connections they form are different from the typical connections, quantitatively speaking.
The diagram below shows the degree of each station in the network (how many other stations it connects to) and its relation to the number of rentals the given station is involved in. The bulk of the data points follows a linear correlation: the greater number of other stations a given station connects to, the more rentals take place at that station, or vice versa. There are however some points that have a lower degree than the number of rentals suggests. If all stations that are part of either Hyde Park or Kensington Gardens (immediately to the west of Hyde Park) are colored green, the pattern in the diagram below is found.
In other words, the stations in and around the largest park in London generate a large number of rentals, but the flow of bikes does not spread to as many other stations as one would expect based on trends seen for other stations. A closer inspection of the data shows that bikes rented in and around Hyde Park and Kensington Garden end in the same set of stations more often than typical stations. Hyde Park is clearly a small world of its own where leisure bike rides take place — mostly during weekend afternoons, summers in particular.
Descriptive to Predictive to Value Creation by (Open) Data
The data analysis so far has been descriptive. Through it we discover properties and trends of past bike sharing in London. This itself is valuable. I will let this conclude the analysis, and only describe how the analysis of this particular (or any general) transportation data could continue.
The logical next step is to define a model that quantitatively embodies the hypotheses proposed following the descriptive analysis. For example, we could based on the analysis above construct a mathematically simple prediction model of number of rentals at any given station, at any given month, weekday and hour of the day. With that in hand, we could proceed to discover specific days that deviate from the trend — bank holidays, Chelsea playing in a Champion’s League game, storms from the Atlantic, construction work — and gradually refine the model for an increased accuracy.
That however requires the incorporation of data from other sources. A great next step in the data analysis would be to suggest new stations for bike rental. But in order to do that we have to go beyond the historical rental data. Now the question is: what conditions of the city and a particular neighborhood makes it conducive to a large number of satisfied bike renters if we built a new station there? The model must incorporate other variables — demographics, subway network, social media sentiments etc. This puts the focus on value creation: To optimize the conditions for the people commuting, not just the means by which they in part do their commute by. Data analysis is one approach to this, but only possible when multiple sources of data can come together — again a reason for open data.
- All visualizations, except the abstract network, are done with Tableau Public.
- The network visualization is done with Cytoscape.
- All operations are done using the very flexible Pandas library for Python; calculations are otherwise implemented from basic Python and NumPy functions.
- Data is obtained for 2015 in the CSV file format after a brief registration with the Transport for London data portal.