Open Rail Data

What is Darwin?

The rail industry is all about data. That’s what we do at Assertis every day. Most of this data is not publicly available, but today I’d like to talk about the data that is — the Darwin Push Port.

What is Darwin? The documentation says it’s “an XML push feed that continuously streams information about the creation of, and changes to, train schedule records, together with train running predictions made by Darwin”.

In this short blog post I’ll try to briefly describe what Darwin is, how it works and how to use it, what data it has and a simple usage example. I will also include useful links, describe the data format and point you to the code examples so you can get up to speed quickly.

Photo credentials: Digital Actuaries

What does the data look like?

The amount of data available is huge. You get the schedules and live updates of trains across the whole UK. As the data is streamed live, every second, you’d probably also need your own data storage. As mentioned earlier, all data is sent as XML over an ActiveMQ STOMP stream. There is also a daily zip package of so-called Reference Data & Timetable available via FTP and updated every 24 hours at night.

The Reference Data contains data like station TIPLOC, CRS and their full English names. The Timetable data is basically what the name implies, timetables of all trains for the upcoming 48 hours. The data that’s coming by push port is also in the Timetable format, with the difference that it includes actual train times and time predictions.

Let’s see how an example journey (Euston-Crewe) looks like and what those tags mean:

Each schedule entry starts with a <Journey>. The parameters are:

  • rid — a unique train identifier in Darwin (generated by Darwin)
  • uid — the TUID from the DTD/CIF timetable feed
  • ssd — day when train starts
  • toc — company running the train (in this case LM is London Midland, you can look it up in the Reference Data)

Inside <Journey> are stations on which the trains stops or is passing by.

<OR> is an origin station, <DT> a destination, <IP> is a station at which the train stops, and the last <PP> is a station which train just passes by without a stop.

The parameters are:

  • tpl — TIPLOC
  • wtp — time of passing
  • pta — planned arrival time
  • ptd — planned departure time
  • wta — actual arrival time
  • wtd — actual departure time
  • plat — platform at the station

As you probably noticed, <OR> only has departure parameters, <DT> arrival parameters, and <IP> both departure and arrival parameters and <IP> only a passing time.

How to get the data

First, you need to register an account at https://datafeeds.nationalrail.co.uk. After creating your account, log in and get the FTP credentials required to download Timetable , Reference Data from FTP and your STOMP credentials and queue name for real-time data feed. There are plenty of STOMP clients for all languages.

It’s probably best to start by downloading the timetable & reference data, storing it in the database and then process and store/update the data coming from the live updates.

If you’d like to check out some code examples, please see my attempt in Go language here. It’s still in progress and very early, but will show you how fairly easy it is, to download and parse the data for further use.

What to use it for?

At Assertis we had a hackathon, and I was part of the team using Darwin to show which trains are running on time for a certain journey. Our architecture was quite simple, we parsed the live data and stored it inside a MySQL database. As each train has it’s unique ID (rid), it’s fairly easy to match the incoming messages to the existing data.

On top of that, we had a simple PHP REST API reading data for a given route/journey, which we fetched in a plain one-page web application. That does not sound like a very useful application to do, but we’d made it work in around 8 hours.

Lastly, if you’d like to start playing with Darwin, be sure you check the documentation and read the “FAQ” and more importantly, “Good Practice” pages.

Glossary

  • TIPLOC — a seven character string identifier for a certain location (eg. station), used by train planners to identify train arrival/departure times
  • CRS — code identifying location (eg. station, junction, depot), usually seen on tickets
  • TOC — train operating company, name of a company running the given train
  • STOMP — Simple Text Oriented Messaging Protocol, a message streaming protocol

Author: Piotr Jura