Programmatic PATH Real-time Arrival Data

Matt Razza
10 min readApr 9, 2019

--

TL;DR: Over a weekend I wrote an API to programmatically access PATH’s real-time arrival data. You can checkout an example web app here, the UI source here, and the API source here. If you want to just call the API, skip to the end.

Heads up: I’m not going to outline exactly how this data works but this still gets pretty technical. You’ve been warned. ☺

RidePATH

Fairly recently the Port Authority of New York and New Jersey began installing real-time arrival screens in the stations of the Port Authority Trans-Hudson (PATH) rail system. As a daily commuter, this is something I had been eagerly awaiting and was a natural extension of the CBTC and PTC work the Port Authority had been doing for some time. I was even more excited when this data became available via the RidePATH mobile app (iOS/Android). Unfortunately, the app’s user experience, especially with respect to the real-time functionality, leaves a lot to be desired.

After selecting Real-Time from the bottom navigation menu you’re greeted with a loading spinner which quickly gives way to a message: “Just a moment, we’re connecting your device to Real-Time data.” Often, this message can stay on the screen for tens of seconds for seemingly no reason. Other times it is visible for only an instant. Worse yet, once realtime information has finally loaded if you were to switch to a different station (say from Journal Square to Grove Street) you’ll need to wait through the loading spinner and “Just a moment…” message yet again (once for every station).

Waiting for real-time data in the RidePATH app.

This got me wondering: What’s going on here? Is the performance of the underlying PATH API really this inconsistent? Sometimes the RPC calls complete in hundreds of milliseconds, other times orders of magnitude more. Why is this? Can we do better?

Investigation

Armed with some technical questions I spent a few hours investigating what was going on. My first approach was to connect my Android phone to my computer via Android Debug Bridge (adb) and look at netstat and similar tools while using the RidePATH app. This yielded some results but didn’t get me very far so I’ll skip ahead. The next approach is the obvious one: download the RidePATH APK and decompile it. Before decompiling the classes.dex file and looking at the Java code, I browsed around the APK in 7-zip. Interestingly, there is an assemblies directory which contains a large number of .dll files.

C# Assemblies in the RidePATH APK

This immediately screamed Xamarin — a cross-platform app development framework Microsoft acquired in 2016. Xamarin allows you to build mobile apps in C# and run them on every mobile platform. I’ve never used Xamarin but it appears the compiled MSIL libraries are placed directly within the APK (I suspect a .NET JIT, perhaps Mono, is contained in the APK and runs these DLLs). This means it’s time to decompile C# code!

There are many .NET decompilers. I had dotPeek installed and used that but JustDecompile is another good option. There are 129 .NET assemblies in the RidePATH APK but most are external dependencies like a JSON serialization library or an HTTP library. The six DLLs prefixed with PA.Path. are where the PATH-specific code is. I dragged all these into dotPeek and started poking around. One of the assemblies called PA.Path.Webservice looked particularly promising. Clicking around a bit revealed a RealTimeService class within the PA.Path.Webservice.Services namespace.

RealTimeService class

This seemed like it must be it. A function called GetRealTimeSchedule exists which makes an HTTP GET request to different URLs depending on the station (Hoboken, Newport, etc) and direction of the train (to NY/to NJ). For example, in the development environment the URL for NY-bound trains coming to Grove Street would be: https://path-dev-mpp-apptestservice.azurewebsites.net/datafiles/grv_tony.txt Hitting this URL in your web browser seems promising. JSON data is returned that seems to describe upcoming trains:

[
{
"secondsToArrival": "68",
"lineColor": "ff9900",
"headSign": "33rd Street",
"alternativeText": "",
"lastUpdated": "2018-10-09T13:19:32.712369"
},
{
"secondsToArrival": "899",
"lineColor": "ff9900",
"headSign": "33rd Street",
"alternativeText": "",
"lastUpdated": "2018-10-09T13:19:32.712369"
},
{
"secondsToArrival": "",
"lineColor": "d93a30",
"headSign": "World Trade Center",
"alternativeText": "Check Schedule",
"lastUpdated": "2018-10-09T13:19:32.712369"
}
]

This data is old, but maybe that’s because we’re hitting the development environment and they don’t bother updating it. After tracking down the production URL, I made the GET request to it. 404. Crap.

I spent the next 30+ minutes clicking around the assemblies trying to figure out how they were using the RealTimeService differently than I originally thought or if there was some other class I should be looking at but nothing jumped out. I reached out to a friend of mine and we started talking about this problem. He suggested looking at a class I repeatedly skipped over. Right above RealTimeService, in the same assembly, was RealTimeBusService. For some reason I had assumed this must be related to the real-time position of passenger busses (despite the fact that the PATH does not run bus lines) and never considered the bus in the name could be referring to a message bus or some other technical term. This was it. I found it.

RealTimeBusService class

This class creates a subscription to an Azure Service Bus topic (Microsoft’s hosted message broker service) and passes incoming messages to UI components within the app. Having found the right class I started working up the call stack. From here it was fairly straightforward to piece together how the rest of the app worked.

High-Level Implementation

Where’s the Data

At a high level, the PATH’s real-time train arrival data is published to multiple Service Bus topics periodically by some external service. When you open the RidePATH mobile app and select Real-Time, your phone creates a subscription to one (or more) of the topics and starts listening for newly published messages. When a message arrives, the upcoming trains are loaded and displayed in the app. This explains the latency issue and its inconsistent behavior.

As the event publishing is totally asynchronous and completely unaware of the state of any individual subscriber, the act of creating a new subscription does not guarantee prompt receipt of a message. Depending on when you switch to the Real-Time tab and create your subscriber, you’ll be waiting different amounts of time for the next message to be published. For instance, if you end up opening the Real-Time tab immediately following a message publish event, you’ll be waiting as long as a minute for the next message to be published. However, if you’re lucky enough to open the tab right before a message is published, there would be no perceived latency as a message will arrive right after creating your subscriber.

This design is both a natural choice for this kind of data but also highly questionable for a number of reasons:

  1. PRO: Presumably some external system periodically produces estimates of train arrival times at various stations. If you wanted to quickly and easily update multiple systems with the new arrival times, some kind of pub/sub or message bus system seems like an obvious choice. In this case, PATH is probably trying to update all the installed displays in the stations and users of the mobile app at the same time.
  2. PRO: As a commuter, it’s not uncommon to just miss a train so it makes sense to want as low latency as possible when displaying the updated arrival times. Many alternatives to the pub/sub approach, like polling, would need aggressive refresh frequencies to meet a low latency requirement.
  3. CON: The inability to get the latest estimate upon creating a subscriber creates a poor user experience (as outlined initially) and could be avoided entirely with a more conventional API.
  4. CON: Azure Service Bus has a strict limit on the number of concurrent subscribers. On the Premium Tier (the highest tier, which PATH is using) you may only have 2,000 subscriptions to any given topic. This is especially bad as closing the mobile app does not delete its subscription. Instead, the subscription ID is stored on device and reused the next time the app is opened. In my testing PATH has hit this subscriber limit; albeit only on the Newark station which is the default station for the Real-Time screen.

Side-note: #4 means that this system is open to trivial denial of service attacks where an attacker creates many subscribers and consumes all available quota. Please be good citizens and do not unnecessarily consume the subscriber quota.

How It Works

There are two quirks to the system. The first is that each station has its own topic. This explains why, after switching stations in the Real-Time tab, you need to wait for the spinner and “Just a moment…” message repeatedly. The act of switching stations results in the creation/connection of another subscriber to an independent topic. The app simply does not have the data for the next station and needs to connect to the topic and wait for another message to be published. The second quirk is that the upcoming trains are published in two messages — one per direction (one for trains headed to NY and one for trains headed to NJ). This means that the upcoming trains you see for any station, are actually the union of two messages: the latest ToNY message and the latest ToNJ message.

Connecting to these topics is also a bit complicated. The URI representing the topics is static and encoded in the assembly but the token required to authenticate with the topic is not. Instead, the token is encoded in a SQLite database that gets downloaded to your device when you first open the RidePATH app. Getting access to the SQLite database and, eventually, the decoded token is slightly involved.

First, you need to check for updates to the database— this occurs when the app is launched. A GET request is made to the PATH RESTful API (v1/checkdbupdate). The checksum of the currently downloaded database is passed in and the response will either contain the checksum of a new database or 404 if the database is up-to-date. In order to make this API request, an API key is needed. This key is static and encoded in the assembly and must be passed as an HTTP header to all API requests. If a database update is needed, a second API call is made to download it (v1/file/clientdb). The new checksum is passed to the clientdb endpoint to ensure the correct version is retrieved. The returned binary blob is a zip file containing the SQLite database. The database is then extracted from the zip and opened.

Side-note: This SQLite database contains all the semi-static information used by the app. Train schedules, routes, station metadata, etc. This is what makes most of the app functionality usable offline.

With the latest database downloaded and opened you can get the Service Bus token with a simple query:

SELECT configuration_value FROM tblConfigurationData WHERE configuration_key = "rt_ServiceBusEndpoint_Prod";

The returned value is encoded and not directly useable. First you need to Base64 decode the value. The result is an AES-encoded blob. The key and salt used to decode the blob are statically stored in the assembly. Once decoded, this token is used to connect to the Service Bus topics.

The subscriber ID used to connect to these topics is unimportant (so long as it is unique). The mobile app simply generates a GUID and stores it for later re-use. It does appear that unused subscribers are eventually purged from the topics but this seems to take days (hence the ability to run out of subscriber quota). For this reason, I would strongly recommend against generating new subscriber IDs — either by clearing your app cache repeatedly in the RidePATH mobile app (as the app will generate a new GUID each time) or by creating subscribers through code. Please either re-use existing subscriber IDs or be sure to explicitly delete your subscriber when you are done.

An Alternative: The API

I thought rather than do all this and deal with the complexities of connecting to the Service Bus topic, why not just use a simple HTTP API? This approach also has the benefit of not requiring you to wait for a new message to be published before displaying data. So that’s what I created.

I have a server running on Google Cloud Platform with a single subscriber per topic that stores the latest data in-memory and exposes it via HTTP and gRPC.

Simply make a GET request to:

https://path.api.razza.dev/v1/stations/<station_name>/realtime

Where <station_name> is one of the following:

newark
harrison
journal_square
grove_street
exchange_place
world_trade_center
newport
hoboken
christopher_street
ninth_street
fourteenth_street
twenty_third_street
thirty_third_street

The response will be a JSON object that looks like this:

{
"upcomingTrains": [
{
"lineName": "33rd Street",
"lineColors": [
"#4D92FB"
],
"projectedArrival": "2019-04-11T02:54:59Z",
"lastUpdated": "2019-04-11T02:49:03Z",
"status": "ON_TIME"
},
{
"lineName": "33rd Street via Hoboken",
"lineColors": [
"#4D92FB",
"#FF9900"
],
"projectedArrival": "2019-04-11T03:14:19Z",
"lastUpdated": "2019-04-11T02:49:03Z",
"status": "ON_TIME"
},
{
"lineName": "World Trade Center",
"lineColors": [
"#65C100"
],
"projectedArrival": "2019-04-11T02:55:59Z",
"lastUpdated": "2019-04-11T02:49:03Z",
"status": "ON_TIME"
},
{
"lineName": "World Trade Center",
"lineColors": [
"#65C100"
],
"projectedArrival": "2019-04-11T03:10:59Z",
"lastUpdated": "2019-04-11T02:49:03Z",
"status": "ON_TIME"
}
]
}

A demo web app can be found here. The API contract is defined here. And the repo for the server code is here.

Demo Web App

Note: This software is neither endorsed nor supported by the Port Authority of New York and New Jersey.

I may expand the API to include more than just real-time data. Happy coding! ☺

--

--