We built a robot to detect bad real-time data
And now transit agencies are using it to fix their broken real-time tracking feeds.
Real-time data is Transit’s crown jewel. It tells you exactly when to expect your ride to show up. When real-time data is available, you can plan your day around it. But when real-time data goes down, you end up missing buses… trams… trains… weddings…
For the most part, agencies’ real-time errors are little mistakes. A server crashes, or a piece of code breaks. A single feed usually doesn’t publish that much real-time rubbish, but at Transit, we manage hundreds of real-time feeds. The rubbish adds up.
About a year ago, our team realized we could no longer sustain the way we were handling real-time data errors. It involved a manual process, which partially relied on existing tools (checking real-time feeds for wacky data) and partially relied on manual labour (emailing agencies, one by one, letting them know their data was wack.)
Whenever a user complained that real-time data was missing, we’d have to inspect the feed, make sure the error wasn’t our fault, and then email the agency being like heyyyyyyy…… can you fix this?
Enter Morgan Freeman
Our error checker analyzes agency real-time feeds (in real-time), identifying errors proactively instead of relying on user reports. If Robot Morgan Freeman discovers an error, he emails the agency automatically, detailing what’s wrong. But how does he discover real-time errors in the first place?
To publish real-time data, Transit relies on two different agency feeds: their real-time API (usually in a “GTFS-RT” format) and their static GTFS. The API publishes real-time locations and predictions. We match that real-time data to the agency’s static GTFS (which contains transit timetables, route info, etc.) By matching the API to the GTFS, we know which vehicle corresponds to which transit line and scheduled departure. If the API data and the GTFS data don’t match up… we’re in trouble.
To make sure there’s no discrepancies, Morgan Freeman pings one random trip on every agency feed, every few minutes. First, he checks if the API data is fresh (no older than 10 minutes). Then he checks that the API and GTFS data match up. If Morgan notices that something is wrong, he flags the API to check it for errors. Then, instead of pinging just one random trip, he starts pinging LOTS of trips on that feed. Usually, when there’s smoke, there’s fire.
So what sort of errors does Morgan encounter? For one, there’s mismatched trip-IDs. Trip-ID tags are used to identify specific trips in the data feed. They can differentiate between, say, a bus that’s scheduled to leave Port Authority at 08:00 vs. the one that’s scheduled to leave at 08:15. Sometimes, the trip-IDs published by the real-time API are different than the ones in the GTFS. This creates problems for Transit: if we can’t match the GTFS-RT data to the static GTFS trip data, we’ll know where a specific vehicle is, but we’ll have no way of knowing what route this vehicle is assigned to, or what trip it’s supposed to be on. So when you look up that line, you won’t be able get real-time data.
One common reason you get mismatched trip-IDs? Your agency is publishing real-time data from one vendor (the company that equipped vehicles with GPS), while the GTFS is getting exported by another vendor (the company that made the agency’s scheduling software.) These vendors don’t always play nicely together.
Mismatched trip-IDs are by far the most common errors we encounter, but other errors include empty API responses and bad URLs.
If we get an empty API response, it’s worse than a mismatched trip-ID: it means there’s no real-time data to match any more! These errors might happen if an agency’s server goes down. Then there’s bad URLs: when we try to download a transit file, its URL is either inaccessible, corrupt, or the data is unparsable.
How we resolve these errors
Back in the day, we relied on user reports to discover real-time errors. We’d use some internal tools to confirm the error, make sure the error wasn’t our fault, then (manually) type up a report for the agency. That solution wasn’t scalable. So we built software to automatically detect real-time errors. But that just added to our pile of problems: we realized that way more real-time errors were occurring than we thought. There was no way we could type up a manual report every time something broke. We needed to develop a fully-automated solution.
Now, our email reports are automated too. When we spot an error in a real-time feed for the first time, we send the agency a summary of what the error is (e.g. mismatched trip-IDs) and where we found it in their data (e.g. the range of trip-IDs that were affected).
Then, we ask the agency if they’d like future alerts. When we started rolling out Project Morgan Freeman, we wanted to gently test the waters. We weren’t sure if this was something agencies wanted. After all, getting an email at 7am telling you ALL REAL-TIME SYSTEMS DOWN is probably not the best way to start your morning.
Thankfully, agencies have been super receptive: every agency response we’ve gotten has asked to subscribe to future updates.
Before our automatic alert system was put in place, agencies were often unaware of real-time errors. We assumed whenever we filed an error report, we were just one complainant among many. That wasn’t even close to the case: many agencies just didn’t know.
If their vendors didn’t inform them about a real-time outage, and no irate customer had contacted them — agencies could continue publishing bad real-time data for hours, or days. Now, these errors can be resolved within minutes. Which keeps our users happy, and our customer support team sane.
Are all errors made equal?
No. Our error-checking software monitors hundreds of real-time data feeds — not all feeds are given the same amount of attention. A feed that’s relied on by millions of our users (or by an official agency partner) will be higher in our priority stack than a feed that’s relied on by a couple of users. The system helps us track the priority of each feed by weighing how many users depend on it. We use that data to make decisions like: How much dev time (if any) should we commit to a fix? Should we be pinging the agency every few hours, or every twenty minutes? Should we be alerting our users if their real-time data is down?
This is just one of many interesting problems our engineering team is working on behind the scenes. While an average rider will never notice that Transit has been in touch with their agency (with the help of a magical robot named Morgan) to expedite real-time data fixes, it’s this sort of engineering that makes the difference between a transit app that’s usually right and one that’s reliably right. Ultimately, the purpose of Transit is to empower public transit — to give riders an extra reason to take it. Making their real-time more reliable is a huge step forward.
Feel like working on our next exciting back-end project (code name Project Jack Nicholson)? We’d love to hear from you.
And if you’re an agency that wants to fix real-time data as soon as errors appear, get in touch with our friend, Morgan Freeman.