Clustering bus lines in upstate New York

Jdavitz
INST414: Data Science Techniques
4 min readApr 25, 2024

I entered this assignment looking to find out how different bus lines in the DC area are similar to each other through clustering. This information would be useful to a planner or executive for the WMATA which is currently experiencing significant revenue shortages and looking to reduce services in order to cut costs. This information will inform the decision of what services can be cut while still providing effective service to the largest possible area.

The data that would answer this question would be a list of unique stops for busses and a list of bus lines including each stop that the bus makes. This would allow for identification of duplicate or similar lines that could be removed without major impacts on service for customers. I was not able to find this specific data for the WMATA in the DC area, they only listed stops and bus lines not the association between them. Eventually I was able to find this type of data for a few bus lines in upstate New York. The data was smaller then I hoped but came in a CSV from the state of New York open data website with a list of stops, details about each, and bus lines servicing each stop.

Once I had downloaded the data I needed to clean it and remove a number of the extra columns filled with either blank or not necessary information.

bus_stops_slim = bus_stops[['Stop ID', 'Routes']]
bus_stops_slim.columns = ['stop_id','routes']
bus_stops_slim.head()

This left me with a simple table that contained the id of the bus stop and the routes that service it in each row.

Next I needed to split the routes column apart so I could access the individual values of each.

def split_routes(row):
row.routes = str(row.routes).split(',')
return row

bus_stops_slim = bus_stops_slim.apply(split_routes, axis=1)
bus_stops_slim.head()

Once the routes had been broken up into an array that could be iterated through I added each stop to a dictionary with the key being the bus line and the values being another dictionary with the stop and the number of times the bus stopped at it. Finally I turned this into a data frame which contained a nice matrix of all of the stops and bus routes.

stop_map = {}
for index,row in bus_stops_slim.iterrows():
for route in row.routes:
this_route = stop_map.get(route, {})
this_route[row.stop_id] = this_route.get(row.stop_id, 0) + 1
stop_map[route] = this_route

index = stop_map.keys()
rows = [stop_map[k] for k in index]
routes_df = pd.DataFrame(rows, index=index)
routes_df = routes_df.fillna(0)
routes_df

These are the features that I plan to use for clustering/similarity, routes can be analyzed for similarities based on the stops they service. More similar bus lines will be have more stops in common and in theory will be clustered together. To measure similarity I am using SKLearn’s Kmeans algorithm and Euclidian distance

In order to select the k value I decided to do a number of fixed values, I chose to run from 2 to 5 as values of k. After selecting my K values I was able to run the clustering. Immediately it seemed as if the clusters were very unbalanced with cluster one always having over 100 entries and the other clusters having less then 5. After looking at the results (below) I decided to continue with just k=3.

When I reran focusing only on the k value of 3 I found got the following results:

For cluster 0 there were 6 entries, for cluster 1 there were 134, and for cluster 2 there were 2.

In Cluster 0 two entries I looked at were bus lines 922 and 923 which had some common lines and appeared to both be commuter lines that ran from outside Albany NY into the city.

In cluster 1 I looked at two of the results, the 807 and the 801 which were both lines that served the outskirts of Albany NY and while I didn’t see any similar stops they served the same type of purpose.

In Cluster 2 there were two small lines serving specifically the downtown of Albany NY.

Some major limitations of this analysis were first the size of the transit system surrounding Albany NY which is where the data I pulled came from. I would have preferred to be able to get a large metropolitan area like DC or NYC but had to work with what was available. Additionally the only metrics that I used to cluster the lines together was the stops that they shared. Some cities offered geo data of their actual bus lines so combining that with the stop data, or even information on when the lines were active or ridership data would greatly increase the complexity of the analysis. I don’t believe that there was bias introduced into the clustering or the data since it only involved bus routes.

GitHub: https://github.com/not-senate/414-assignment-4

--

--