How we made a website to track our bus system in 36 hours

Published in

cylussec

10 min readApr 4, 2019

An MTA bus at the end of its run (Photo: Brian Seel)

Like many American cities, Baltimore has a bus system that can be slow, unreliable and frustrating. In 2016, after cancelling the first east/west grade separated rail project in Baltimore, Governor Larry Hogan started the first overhaul of the bus system in 50 years, with promises of a more reliable, high frequency bus system that would allow people to ignore schedules, and instead just ‘show and go’. This overhaul was branded ‘BaltimoreLink’.

As a user of the system, the reality seems to be a bit different. Instead of high frequency buses, there would be large gaps, bus bunching, and buses that plod through downtown traffic. As a user, one can send a picture of bunched buses, tweet at MTA about how its been 40 minutes since a high frequency bus has shown up, or scream into the ether when you are late because of the mass transit system.

However, as a part of the BaltimoreLink rollout, each of the MTA’s buses were outfitted with Swiftly’s GPS units, which transmits bus location data every 10–15 seconds to enable realtime tracking information. This has been a revolutionary change to the system, as it allows users to know exactly when a bus will be there based on its location, instead of a schedule.

A screenshot of the Transit App, which relies on the Swiftly data, being generated by GPS units on each bus. (Screenshot by Brian Seel)

One of the users of this is the Transit App, which allows users to find transit based directions to their destination, see what bus lines are nearby, and see how far away each bus is. This allows users to walk up to a stop a few minutes before the bus arrives, instead of showing up and hoping the bus is on its way.

This data is powerful, but the MTA does not share historical data that Swiftly provides. The public has to rely on the self reported on-time data, which has a spotty history, and is released occasionally. The MTA revealed that they had previously counted a ‘no show’ bus as perfectly on time. Also, they changed the definition of ‘on time’ from the industry standard of 1 minute early/5 minutes late to lenient standard of 2 minutes early/7 minutes late. Even by these metrics, their on time percentage is less than 70%, and is usually only released when a citizen does a Public Information Act request.

The MTA does collect all of this historical data, and they have internal dashboards that show the actual arrival times of buses for each day, when bus runs were cut, and the on time percentage with the more strict on time standards. These dashboards are not shared with the public.

As users, how are we supposed to know if the system is improving without complete data? How are we supposed to know how the system is stacking up to other systems when we use a different metric to measure what ‘on time’ means? And how are we supposed to know if the MTA is telling the truth without solid data?

That is the issue we wanted

Technical Background

During the 6th Baltimore Hackathon, our team wanted to develop a solution that would collect historical data so we could start answering questions about how well the system is operating. After we got an API key from the Maryland Transit Administration, we had access to all of the Swiftly data. Swiftly provides data in the General Transit Feed Specification, which is basically a standardized based format with information about stops, routes, trips (instances of routes), and vehicles, among other data. Swiftly’s adds to that by providing GTFS-Realtime data which provides realtime location data, arrival predictions, and logic to determine which route each vehicle is on, and is the data we are looking to store.

One of our team members had experience with the ELK stack, and we decided that would be a good way to process and store the data. ELK stands for Elasticsearch, Logstash, Kibana, and did much of the heavy lifting that made this project achievable over a weekend. Elasticsearch is a NoSQL engine that allows for easy access to the stored data. Logstash handles the ingest, parsing, and storing into the datastore. Kibana is the frontend that has some pretty powerful data visualization tools.

Infrastructure

We decided to use an AWS instance of the ELK stack, because we wanted something that would be easy to setup, would bypass the issues of configuring the software, and would be easy to replicate. This last point ended up being very important, as we needed to add more resources to the VM at one point during the weekend, and needed to replicate the VM when we were all trying to work on similar issues.

The setup seems to cost about $1 per day, even for a large VM, which made it very palatable. I hope to find an organization that will take this over longer term down the road, but this level is very easy for me to sponsor myself in the meantime.

Implementation

Note: the full code is available on my github.

We started with the Bitnami version of the ELK Stack VM, running version 6.6 of Elastic Stack. We originally started with a small sized image with 2 GB of memory, but quickly ran out of memory because of Java. We bumped up to a large VM, with 8GB of memory, and bumped up the heap size by changing the file in /opt/bitnami/logstash/config/jvm.options so the total heap size was 8GB.

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx8g

We might have been able to get by with a smaller image, and less memory, but time is precious in a hackathon and VMs are cheap. We will probably look to bump that down later as there are efficiencies we can find with more time.

With Logstash, you can use input plugins to bring data in, filters to manipulate it, and then output filters to push it to storage. Since we needed to read data from the Swiftly REST API, we used an http_poller. We are essentially crafting the request, with the API key in the header.

realtime =>{  # Supports all options supported by ruby’s Manticore HTTP client  url => “https://api.goswift.ly/real-time/mta-maryland/gtfs-rt-trip-updates"  method => get  headers =>  {    Authorization => “xxxOUR-API-KEYxxx”  }}

We then setup the request to run every minute with the ‘schedule’ section, as that’s the highest frequency we can pull with the cron timing. The buses emit data every 15 seconds, so it might be nice to pull the data more frequently, but we decided that this was fine. Based on the way we setup our script, we were able to work around that issue.

request_timeout => 60# Supports “cron”, “every”, “at” and “in” schedules by rufus schedulerschedule => { cron => “* * * * * UTC”}

We then need to convert the data from protobuf to JSON. This was one of our major sticking points. We figured that we could use the GTFS bindings from Google’s repo, but those did not process for some reason, and I have no idea why they are in that repo. We recompiled the original spec and got a version that worked, which we referenced in the below include_path.

codec => protobuf{  class_name => “TransitRealtime::FeedMessage”  include_path => [‘/opt/bitnami/logstash/gtfs-realtime.pb.rb’]}

At this point, we have pulled in the data, and parsed it with the protobuf library. Every minute, we are pulling in massive amounts of data about where each bus is, and we need to distill that down to the most important data. We decided to take a ‘stop centric’ view of the world, where we only cared about when a bus passed a stop. This is a much simpler model than if we had a ‘bus centric’ model, as we then have to always track where the bus is, and is a lot more data. Our way seemed much simpler.

We used a logstash ruby filter, which uses free form Ruby code to do all of the parsing and event emitting. Our input data was basically a bus on a route, and all future stops I would make. For example:

entity {  id: “2261246”  trip_update {    trip {      trip_id: “2261246”      start_date: “20190327”      schedule_relationship: SCHEDULED      route_id: “11032”      direction_id: 1    }    stop_time_update {      stop_sequence: 34      arrival {        time: 1553738272      }      stop_id: “1335”      schedule_relationship: SCHEDULED    }    stop_time_update {    stop_sequence: 35    arrival {      time: 1553738355    }    stop_id: “1336”    schedule_relationship: SCHEDULED  }  stop_time_update {    stop_sequence: 36    arrival {      time: 1553738373    }    stop_id: “1337”    schedule_relationship: SCHEDULED  }
...

This shows that there is a trip underway, with a specific route it is following, and based on its current location, it has a set of stops that it should hit. In this case, the bus is in the middle of its route, so the first stop is stop_sequence 34, which means that it has already hit its first 33 stops.

Swiftly is doing an immense amount of heavy lifting in the background, as they are taking the bus location, snapping it to a route, determining which direction it is going, which stops it has already hit, which stops it still needs to hit, and providing an estimated arrival time for each stop. The predictions are based on historical data, as traffic during the day, and the length of stop lights can make a big difference in how long it takes to arrive at a stop. This information goes well beyond what GTFS provides.

Having access to a plaintext version of the JSON was very useful during our development, and we used Postman. There were many times that we needed to sanity check what we were seeing, so we were regularly checking Postman. The ruby filter is specified with the following code.

filter {  ruby {    path => “/var/chop.rb”  }
...

The file, /var/chop.rb, is available on my Github account, and has two functions. The register function is called when the Logstash first starts up, where we load in all of the static GTFS data. That data is loaded into variables that we can access later during the run.

We needed a quick and dirty way to get this done because of our deadline, so we just directly load the data into memory. This is not efficient, as it takes a long time to load this data and takes a gargantuan amount of memory. The speed could be improved by parsing the data, and then saving it in a binary format, similar to what Python does with its Pickle library. The space issue could be improved by only saving the data that we actually use, because we load the entire stops file in, but only use the ID and scheduled arrival time.

The second filter that we use is mutate, which takes the output from the previous filter and renames them into location points. This enabled us to use Kibana’s mapping to show stop locations, bus locations, and information about how buses were faring against schedules visually.

...
  mutate {  rename => {    “stop_lon” => “[location][lon]”    “stop_lat” => “[location][lat]”  }}

Finally, the data is output to Elasticsearch under the ${aggregate_id} document. This data is then accessible under the ‘discover’ tab on Kibana

output{  elasticsearch  {    hosts => [“127.0.0.1:9200”]    document_id => “%{aggregate_id}”    index => “geo-rt-trip-updates”  }}

Examples

We got the data flowing about an hour before judging, so there wasn’t a lot of graphs we could show. One that was very interesting was a run of the 71 bus. It goes from a light rail station in the suburbs to downtown Baltimore. The graph to the side shows that it started about a minute off schedule to start with, and then went wildly off schedule as it progressed. Under further inspection, we found that the turning point was after the bus crossed the bridge to enter the city.

Another was a map view looking at the same kind of information. The 22 bus goes from Northwest Baltimore to Southeast Baltimore, and we were able to make a graph showing how it got more and more off schedule as the run progressed. Unfortunately, we have to buy a Kibana license to get higher resolution maps, which is why the picture to the side is grainy, but it shows the power of this data, and how visualizations can paint a picture of what is happening.

Both of these visualizations were done on single runs because of our limited dataset, but it can easily be done on a larger dataset when we collect that.

Follow ups

There is still quite a bit of work that we have to do, with the biggest thing being to just collect a substantial amount of data. We also need to start building graphs that start to tell meaningful stories about what is happening in the data set, which has shown to be a challenge. We might need to look at alternate front ends, as Kibana seems like it might not be completely setup to display the kinds of data we want to display. Its strength seems to center around log collection.

We also skipped over handling ‘cancelled’ runs because of our deadline. ‘Cancelled’ runs are ones where the bus never departs, either because the bus was out of service, the driver did not show up, or the bus was so far behind that it had to rush to the end to catch up. There are usually known as ‘cut runs’, and our setup currently throws these runs away.

We also need to improve the initial parsing, to improve the load time and the memory footprint. The load time only happens once when the server starts up, but the memory footprint is significant, as the JVM was taking up 10 GB in our testing.

We can also look at pulling in data from other cities to see how the MTA stacks up. There were many assumptions that we made because the MTA data looks a certain way, so we might find issues in places where we did not adhere to the API spec.

Either way, there is still plenty of work to be done on this project!