From Sensing to Sensemaking: Analyzing, Visualizing, and Modeling New York City’s Buses
This article, the third in a series, highlights the work of students in Cornell Tech’s Urban Tech master’s program leveraging high-frequency bus journey tracking data scraped from the New York MTA’s BusTime API. This was carried out during year-long small group & independent study projects in 2021.
Big data is the new oil!
No it’s not. I hate that metaphor more than most.
But streams of high-frequency sensor data, like the data I started scraping from the NY MTA’s BusTime API in October 2022, are a vital raw material for sensemaking.
Soon after I started scraping the BusTime API, to track the status of every bus in New York City every 60 seconds, I found an opportunity to share the data with our students at Cornell Tech as part of a unique year-long degree requirement, the “Specialization” course. This course, which spans students’ first and second year, puts students into small groups to pursue independent research projects in close consultation with an advisor. Students begin their work in the spring with a 3-credit commitment that focuses on exploring their research topic, homing in on a question, and collecting the tools and techniques they’ll need to carry out the work in the fall, when the commitment doubles to 6 credit-hours.
This bus data is rich. By the fall, we’d accumulated nearly a half-billion observations of up to 2,000 buses operating on more than 250 routes over nearly a full year. During that time, the bus system had seen the city start to bounce back from the public health crisis and economic slump of the pandemic, severe weather events including heat emergencies and historic, torrential downpours, and more. There was so much to explore in this data.
Below we take a quick look at the projects Cornell Tech’s urban tech graduate students carried out to analyze, visualize, and model the New York City bus system during 2021.
A diff Tool for Big Bus Data
While I’d built an API for delivering bulk data to student researchers, it wasn’t simple to turn that raw data—essentially a pile of breadcrumbs—into useful information. For instance, computing basic service metrics like travel speed required multiple steps of fetching data, transforming it into a clean dataframe, and computing summary statistics. What’s more, comparing different time periods or different routes added more complexity.
diff is a old Unix command-line tool used to compare the contents of two files. Online versions of this trusty tool make it easy to see the differences between versions of a file, and diff is the backbone of version control systems like git. Could we use this approach to provide a simple web-based tool for comparing different slices of the bus data archive?
That was the thinking behind an interactive query and comparison tool developed by Jeremy Shaffer. With this tool data seekers can construct a two queries, with different filters for route and time period, and compute summary statistics comparing across them. This allows for comparisons on the same route at different times, or between different routes at the same time (or different routes at different times, though this is probably less useful).
Highlights of Jeremy’s work include:
- Dynamic retrieval of data from our API
- Computation of average travel speed and bunching, taking into account the unique geography of each route via published GTFS geometries
- An elegant, performant interactive dashboard implemented in Streamlit.
Maps for Playback and Real-Time
While most of our students were enrolled in the Urban Tech master’s specialization, Jamie (Hao) Geng joined our group from Cornell Tech’s Connective Media program. Jamie’s work zeroed in on developing a way to grapple visually with the mass of data contained in our system. Naturally, this quickly led to the development of a map-based tool, with two “channels”:
- a “history” view* for playing back the hourly bulk data files provided by our API
- a “live” view* for fetching and viewing a weather-radar-style loop of the last 5 minutes of bus movements citywide (which we felt provided a more compelling visual experience than the snail-like pace of the animation on the MTA’s new interactive subway map)
We’re already thinking about how to further refine the “live” map as a public-facing visualization, and perhaps as a public display in the Bloomberg Center at Cornell Tech.
Highlights of Jamie’s work include:
- A performant React-based map leveraging deck.gl for rendering big, animated data sets — including a variety of options for viewing high-frequency urban data and derived metrics in summary and in detail
- A clever “loop” animation to show “real-time” bus movements in a more informative, and visually appealing method than actual real-time.
- A serverless (AWS Lambda) backend to retrieve and parse the latest bus positions from the BusTime API
*n.b. Because of some idiosyncrasies in the way Cornell’s campus network routes IP addresses for Amazon web services, some of these maps may not work well for Cornell users. It’s suggested to use an external network or personal hotspot for now, until we can resolve the routing issues.
Predicting the Impact of Subway Disruptions on Bus Crowding
By far the most interesting development in bus data in 2020 was the introduction of passenger counts. This data was quickly exploited by the MTA and a variety of app provides to broadcast information on levels of crowding aboard buses during the pandemic. But Lars Kouwenhoven wondered whether the data could reveal other patterns in the bus system. As we explored the possibilities, Lars zeroed in on an important question—what happens to ridership on buses when service on nearby subway lines is disrupted, and can we predict that surge in crowding reliably?
Lars spent the spring developing a robust machine learning model to predict the number of passengers on future trips, as well as trips on buses without occupancy sensors. But work on the larger project of predicting the impacts of subway disruptions continues. The relative rareness of these events—just 1/2 of 1 percent of bus observations were flagged as being effected by a subway disruption—as well as the complexity of relationships between the two network topologies limited. More data, and more refinement are expected to improve the effectiveness of the model.
Highlights of Lars’ work include:
- Implementation and evaluation of an XGBoost-based machine learning pipeline to predict the number of passengers on buses in the future, as well as the 60 percent of buses without occupancy sensors. (Though the model had some shortcomings, the mean average error was 4.09, and there was potential for future improvement).
- Grabbing, parsing, and merging data from the MTA service alert feed to determine the period of subway disruptions and affected bus lines.
- Use of a cyclical time transformation to improve the model’s understanding of relationships between late night and early morning travel.
- Identification of interesting transit rider behaviors around subway disruptions, notably a surge in riders onto nearby buses just before the issuance of an unplanned subway disruption alert, indicating real-time reactions to the service change.
Buswatcher-Insights: A Public Codebase for Machine Learning
Normally, the Specialization project spans a Spring semester and the following Fall semester. It’s kind of a cool setup, because students go off in the summer and come back with new ideas and sometimes big leaps from tinkering and noodling they’ve done in between stints in internships or on the beach.
Alexander Amy and Sanket Shah took a different path, pivoting away from a different Specialization project they’d worked on in the Spring and joining the bus data group in September for a crash course in transit tracking. Like Lars Kouwenhoven (see above), their intention was to exploit the bus data to do an in-depth machine learning project.
As we explored possible experiments, a series of extreme weather events that had occurred in New York City over the summer were fresh on my mind—torrential downpours caused by the remnants of Hurricane Ida, and a series of historic heat waves. Buses are one of our best tools for shifting people from cars and taxis to a lower-emission form of travel. But was anyone thinking about climate-proofing the bus network itself? A quick survey of relevant literature revealed a good body of work on how precipitation can drive people from buses, but heat was less studied.
This project brought together 3 sets of data:
- the bus positions provided by our API,
- a data set containing the locations of bus shelters published by the city,
- weather data indicating heat, humidity, and rainfall for the city.
Alex and Sanket worked hard to build out a prototype model. However, the complexity of phenomena they were trying to model ultimately proved too great to definitively answer in one semester. Rather than leaving the project there, however, the team regrouped and embarked on a different, and highly constructive effort—to clean up and compile their codebase to hand off to future students, and put their experiments online in an interactive Streamlit application that allows us to view and interact with their data, models, and evaluation metrics.
Highlights of Alex and Sanket’s work include:
- Creation of an effective, generalized, and extensively documented set of methods and tools to accelerate the work of future students and researchers working with New York City bus data
- Creation of a interactive web app that “shows” instead of “tells” the implications of different model and feature choices made during the experiments.
In the final post, we look to 2022 and the opportunities and challenges facing the next group of student researchers digging into New York City’s bus system.
Further Reading
This series of four articles documents the thinking and work of our bus data working group at Cornell Tech throughout 2020–2021, and lays out a roadmap for 2022.
- The first article, “Can Better Data Unlock A Bright Future for Buses?”, provides background on the project — its inspiration, its aims, and its importance.
- The second article, “The Bigness of Bus Data”, digs into the techniques we developed to retrieve, store, and distribute data on the operations of New York City buses. We’ll show you some code, and show you where you can fetch slices of the data we’ve culled, as well as the raw original responses we parsed it from (and why you’d ever want to do that in the first place.)
- Article three, “From Sensing to Sensemaking: Models, Analytics, and Visualizations”, explores student work in modeling, analytics, and visualization built on top of this data. These demonstrate the potential of what easy access to longitudinal data about buses can unlock — these include:
- a diff-style data analysis tool for comparing performance of bus service by route and period developed by Jeremy Shaffer;
- a predictive model created by Lars Kouwenhoven for understanding the impact of service disruptions on nearby subway lines on bus crowding;
- a predictive model exploring interactions between severe heat and rainfall, inadequate shelters at bus stops, and ridership — a harbinger of structural challenges to come that could limit the system’s utility as a tool for carbon emission reductions — put together by Alexander Amy and Sanket Shah; and,
- an interactive map-based data visualization of the entire New York City bus system designed by Jamie (Hao) Geng, with channels for switching between a real-time view of what’s happening right now, and “playback” of historical observations. - Finally, article four, “Towards A Global Bus Observatory” lays out our vision and plan for 2022, and the opportunities and technical challenges we anticipate in the year ahead.