It is well known that bicycle mode is one of the most efficient in dense urban areas. Riding a bike has one of the smallest commuting time span when the distance is lower than 10 km. Buenos Aires is a perfect city to ride a bike, it is sunny, it is almost flat, and it streets are almost parallel and perpendicular which is easy with orientation while riding. There are two main ways to drive a bike in BA: by your own cycle or using the public network of bicycle service.
In this post will put focus on the latter by analyzing the open dataset of public bicycle usage of the year 2018 of the city of Buenos Aires.
Previously, my colleagues Fede Catalano and Billy Mosse already described the spatio-temporal behavior of the network and also inferred hidden social patterns. Here we will study the detection of station communities. Some of the questions that arises are: Are all the stations equal? Are some more “special” than others? In that case, which main difference have? How can we find a similarity measurement between stations?
Building the graph
The first step to understand the communities of stations is to build a network where each node is a station and each edge between station is the quantity of bicycle journeys. Our station network has 200 nodes and … 38483 edges! Journeys between the same pair of stations will contribute to increase edge weight, so the more journeys between a pair of stations, the bigger the weight between those nodes.
Once we have the list of edges (from node A to node B and so on) we build the adjacency matrix. This symmetric matrix indicates at each position the edge weight corresponding to the pair of nodes located at their row and column position. Then the position “i-j” of the matrix will contain the number of edges from the node “i” to “j”.
Adjacency matrix are a powerful representation of graphs which allow to explore graph properties using mathematical methods, for example finding communities within the network. A community (aka “clusters”) is defined as the group of nodes which intra similarity is higher than the one between any other node outside the community. In our case, similarity is equivalent to say the number of edges between two nodes.
With the graph we can enter to Gephi tool and perform a modularity analysis. It compute an estimation of the number of communities and how well compact these are. Modularity is a score between 1 and 0. The closer to 1 means that communities are strongly connected inside and have a few connections with nodes from other communities. Modularity close to 0 corresponds to the opposite. Our graph does not have a strong modularity thus its communities are not well defined.
and also its degree distribution
From the adjacency matrix we can follow to discover communities using Spectral Clustering (SC), one of the most popular methods to find clusters in graphs based on eigen-decomposition. The pseudo-code of a SC algorithm is:
- Compute the Adjacency Matrix A
- Compute the Degree Diagonal Matrix D
- Get the Laplacian Matrix by L = D-A
- Apply eigen-decomposition on L to obtain it corresponding eigen-values and eigen-vectors.
- Use the first “n” eigen-values to select their corresponding “n” eigenvectors.
- Use the selected eigen-vectors to build a subspace and then apply k-means algorithm on it in order to obtain clusters.
If we project the adjacency matrix data in two dimensions and label each station with its corresponding cluster label we will find our communities and how are these distributed. It is clear that clusters are somehow overlapped since some samples (bike stations) are just in the middle of two communities.
In addition, the number of stations per cluster is quite balanced. This means that each community has almost the same number of stations within it.
Now the most interesting part. Finding social patterns within communities. First we transformed the raw data into a mathematical object (a graph) and from there we have determined the clusters via mathematical methods. Now we want to figure out which social finding are hidden in our communities.
Our analysis reveals that communities are distributed by geographical locations. This means that stations within each community are closer between each others than stations from different communities.
- Community 1 is correlated with Comuna 14 (palermo) and Comuna 2 (recoleta).
- Comunity 2 is correlated with Comuna 1 (microcentro) and Comuna 4 (la Boca).
- Comunity 3 is correlated with Comuna 3 (San Cristobal) and Comuna 5 (Boedo) and part of Comuna 6 (Caballito).
The stations with the highest number of trips between them are Planetario and Plaza Italia, with 2328 trips!!! We know that this area is super touristic and perhaps is one of the favorite choices of BA visitors :)
- The bicyle trips are relatively short. Perhaps these are complementary parts of a multi-modal travel and not replace them.
- The logistics of bicycles (making sure that all stations have availability of bikes by moving them by trucks) is responsibility of the ministry of transport. This means that logistics should be focused on each cluster.
- Build a graph using the time of each bike trip instead only of the number of trips.
- Analyze the “balance” of number of bikes that enter to a community vs the number that leaves a community.