Subways and Transit Updates in Real-Time

Tim Spann
Cloudera
Published in
6 min readFeb 13, 2024

Apache NiFi, Apache Kafka, Apache Flink, JavaScript, Python, GTFS, Postgresql, SQL

— — All Transit Systems: Add cache of system list in Postgresql

— — Using the Database Schema Registry in Postgresql

— — Adding MTA Bus Systems to All Transit Systems

— — Mobility Database Catalog

Source Code: https://github.com/tspannhw/FLaNK-Transit

The Real-time Data Feeds for MTA Subways produces an interesting version of GTFS data that I wasn’t getting before. This is Trip Updates, Vehicle Positions and Alerts all in one file. Well that’s a problem. So let’s fix it with Python.

The way to fix this was to use the GTFS Python library to split these into the three separate files and then just output one. You select which one with a parameter. I also found out that MTA requires a login as an HTTP header so we had to set that.

After I did the subway data, I found a number of other MTA feeds and another local one for me in Pennsylvania. SEPTA is the transit system there and they have a lot of data as well.

There is a free list of subway stations, I will next make a generic processor that does a lookup like the Halifax system.

https://data.ny.gov/resource/i9wp-a4ja.json

There is also a feed for the status of MTA Subway stations. I should look at that as well. The more feeds we get whether relatively static, batch or streaming we can contain to add them to improve our analytics, predictions, ML models and Generative AI. It seems that storing the tabular data in PostgreSQL, text data in a vector database (and maybe a search engine like Apache SOLR as well) to augment and enhance GenAI predictions is useful. It is also data that we can distribute to all of our multi-hybrid cloud heterogenous data platforms and systems. I probably will land a wide version of this data in Kudu, Ozone/S3/Object Storage, HBase and/or Iceberg. I will experiment, perhaps all as data pricing is cheap.

There are also dimensions to add from the TRANSCOM agency for road status, street and highway cameras, weather, haze, aircrafts, news, government advisories and alerts.

MTA Subway Data to HTML Viewer

We made the URL and Key sensitive to protect them, this is easy in Python.

MTA Stations

Data Source:
https://data.ny.gov/resource/i9wp-a4ja.json

  1. InvokeHTTP — call JSON REST
  2. SplitRecord — into 1
  3. EvaluateJsonPath — extract fields
  4. QueryRecord — drop coordinates array
  5. UpdateRecord — add UUID and Timestamp
  6. UpdateDatabaseTable — prepare the records for SQL, build a table if it doesn’t exist
  7. PutDatabaseRecord — insert into Postgresql
  8. PublishKafkaRecord_2.6.2 — send records to Kafka topic (mtastations) as JSON
  9. RetryFlowFile — try again on failure

Example Record

{
"DIVISION":"IRT","LINE":"Broadway-7th Av",
"BOROUGH":"M","ENTRY":"YES","VENDING":"YES","STAFFING":"NONE",
"STATIONLATITUDE":"40.840556","ENTITY":"","DAYTIMEROUTES":"A C 1",
"ENTRANCEGEOREFERENCE":"[-73.940083,40.841024]",
"NORTHSOUTHSTREET":"","ENTRANCETYPE":"Stair",
"ENTRANCELONGITUDE":"-73.940083","STATIONNAME":"168th St",
"STATIONGEOREFERENCE":"[-73.940133,40.840556]",
"TS":"1707855985771","CORNER":"","EXITONLY":"NO",
"EASTWESTSTREET":"","ENTRANCELATITUDE":"40.841024",
"UUID":"1c0fe294-1edb-4122-abcb-8c117ac396f3",
"STATIONLONGITUDE":"-73.940133"
}
select * from mtasubwaystations m 
order by borough asc, division asc, line asc, stationname asc

OTHER DATA SOURCES

RESOURCES

Screen Shots

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/