Designing a GTFS Business Rule Engine — Part 1

Published in

Mobimeo Technology

10 min readMar 26, 2021

One of the major key elements when dealing with data is to own that data. Part of this has to do with having a thorough knowledge of data you receive from third parties and consolidating it to make sure you deal with data the way you want to have it, and keep it stable. A prominent type of data we deal with at Mobimeo is GTFS files we get from various sources, on which we base many services. Even though the format is standardized, most of the standard focuses on the structure of the files, not on the content is encoded, and as usual, there are as many different ways of doing it as there are GTFS files. This isn’t good for us, as we would like to have a unified way of handling our data. In this post and the following ones, we will go through this problem and devise a solution based on business rules, to show how the problem can be solved elegantly in a generic way, and propose an implementation based on Scala.

Photo by Konstantine Trundayev on Unsplash

What is it about?

General Transit Feed Specification (GTFS) is the de facto standard for exchanging public transport timetables and the base for many computations, including routing. It consists of a collection of CSV files packed together into a ZIP file, each CSV file representing a public transit entity (e.g. the lines, the schedules, …). The specification mainly focuses on the structure of the various files and gives types for each field appearing in them. However, the standard is quite liberal with the content when it comes to how things are encoded. There exist some best practices, which are mostly followed but still leave much subject to interpretation. Moreover some agencies publishing their timetable as GTFS data are an aggregation of several smaller agencies, each of which might use different conventions. To make it even funnier, different agencies use different conventions as well. This means that when you want to handle this data in a unified way, and want to avoid hacky specific codes and edge cases being hard-coded in your tools, you’d need to somehow pre-process these data to make sure that when it reaches your business logic you can just use it without worrying about where it originates from.

To make it possible, there are two main categories of constraints you want to ensure are respected:

The data that you need to implement your business logic has to be here. For instance you want all lines to have defined colors, so that you can render them nicely to your users.
When you get data, values use the same convention. For instance some GTFS files use the standard basic route types for their lines, others use the extended types, and some even mix them inside the same dataset. Moreover, different agencies use the extended route types in different ways, and you might want to normalize all of this.

That being said, what can we do about it, and how?

Let’s write code for that!

So, we want to transform our data, check that everything needed is there, and in the shape we want it to be. The first reflex as a developer is to open your favorite editor for your favorite programming language and start programming a tool that reads your GTFS file, checks and transforms all known problems you want to solve for the GTFS file, and writes the result to a new file that can then be used.

Typical transformations you want to perform on these data are:

If the line doesn’t have a defined color, look it up in a color map and add it to the GTFS data set.
If the stop name has redundant data for my context (e.g. the city name), remove it.
You noticed that a specific line has the wrong route type and you want to fix it.
There is a stop that is known to be closed and you don’t want to have it in your data at all.
You name it.

This works well but suffers several problems:

If you handle one agency, it might be ok. If you have to work with several ones, you will soon face the problem that you need to adapt your code for each one of them, resulting in hard to maintain branching in your code to handle every agency differently.
Soon, you’ll find new cases you didn’t see before and will need to add them to your code, not a big deal but it grows rapidly into big nested ifs and it ain’t pretty.
Some of the rules you want to implement might be product decisions (for instance, normalize the way stops are displayed to the user) and you might want non-technical people to have an overview of the existing rules. Since you are a lazy developer, as we all are, you don’t want to maintain a separate documentation that needs to be kept in sync with your code.
Remember the new cases mentioned above? Well, sometimes upon GTFS update, agencies fixed some of these problems, or changed them to different ones. This happens all the time! You need to go back to your code and adapt it, remove some conditions, change them.
Depending on your language, you might need to recompile the program, redeploy it in your pre-processing pipeline, and this can sometimes be a heavy process, which doesn’t allow for easy quick fixes.
Here it hits you: a general purpose programming language (GPL) is too generic to make the intent clear and ease maintenance of such highly domain specific rules.

Wait, what did we mention in this last point? Domain specific, it does ring a bell, doesn’t it? As usual with this kind of problems, Domain Specific Languages (DSL) are a nice approach to make it easier to maintain. For the points stated above a DSL can help you in various ways:

By nature, DSLs are constrained to your domain, so you can provide more context or semantics through its constructs than through GPLs.
Defining a DSL makes you define your domain exactly, ensuring that it is clear for you too.
DSLs are an amazing tool to have declarative ways of describing your processing, and declarative languages tend to make it easier to express transformations to apply to data based on some criteria.
A DSL allows you to have a core simple stable language that allows you to express instances of your problems quite nicely.
Let’s be honest, it’s really fun to design your own language.

Designing the DSL

First thing first, before implementing it, it’s nice to reflect on the problem we want to solve in an abstract way. The direct approach focuses on handling every single case in your code, whereas when designing the DSL, it is more about defining the kind of problems you want to solve and providing tools for this. Defer the specific instances of problems to whoever will use your DSL to do it. So let’s start with defining the domain.

The domain

In this post of the series, we want to write a DSL that handles GTFS data, i.e. a well defined set of CSV files, with a defined set of fields for each of them. Each CSV row represents one entity in the file, which will be our unit of work. First of all, this means that your DSL needs tools to:

define on which CSV file you are working;
define to which fields of a row your transformation applies.

For the sake of simplicity, let’s not type the different fields of the CSV files and consider all values as strings.

Being the primary source of data in our case, GTFS might not be the only one. In one of the examples above, we mentioned adding missing colors using a lookup table from another source than the GTFS file itself. This means that your DSL needs access to some sort of context that can hold these data. In this series, we will consider that the context is some sort of arbitrarily nested dictionary (think JSON object).

Now, that should be it for the kind of data we are handling. So what is the point of all of this? The problem we want to solve can be described as follows:

Define a set of rules for some GTFS file, each rule describing for each entity in that file, a sequence of transformations to apply to it, if this entity matches some known pattern. Otherwise leave the entity as it is.

This is a quite generic description of our problem, but it already gives an idea of the basic structure of our DSL. We still have to open questions:

What are the kind of known patterns we want to be able to detect?
What are the kind of transformations we want to apply?

Of course if we leave these two concepts unconstrained, then we are back to our original problem: we have a generic language. In this kind of design, it’s interesting to start by identifying known recurring patterns in the problem instances. You probably already faced some of these problems when you started this, so let’s use them as the base for our design. This will also ensure that your DSL can solve your existing problems from the start.

In the examples mentioned above, common patterns we want to recognize include:

some field contains a given value;
some field is empty;
some field matches a regular expression;
a combination of the above.

This gives you an idea for a basic expression language we can design for the matching part which we will detail below.

The very same examples also give us an overview of the kind of transformations we want to apply:

set the value of a given field to another value (possibly from the context);
remove some part of the value of a given field;
delete an entity altogether;
a sequence of the above.

These examples, again, give us a good idea of the transformations we want to be able to perform, and we will detail them below.

The basic shape of a rule is as follows (in EBNF syntax).

rule = "rule" ident "{" "when" match-expr "do" transformations "}"ident = [a-z][a-z0-9_]*

This rule has a name (identifier), defines when the entity is matched and what to do when it matches. Now let’s dig into the match-expr and transformations expressions. The identifier is a nice way of identifying the transformation when debugging. For documentation purposes, one might also want to add a more comprehensive description to it. The sky is the limit!

Matching rows

So let’s keep things simple, we can always add more later. To decide what entity we want to transform, let’s base our decision only on the content of the entity itself in isolation (i.e. without knowledge about other entities). This matching process is similar to a standard if condition, with a small set of comparison operators, and the standard combinational logic operators. In this series we will use the following matching expression language.

match-expr = "field" ident "==" string
     | "field" ident "exists"
     | "field" ident "matches" regex
     | match-expr "and" match-expr
     | match-expr "or" match-expr
     | "not" match-expr
     | "(" match-expr ")"string = '"' (any char - '"') '"'
regex = '/' (any char - '/') '/'

This grammar represents the AST and doesn’t encode any precedence or associativity in it, but let’s use the common ones:

and has precedence over or;
not has precedence over and;
and and or are left associative;
not is right associative.

Transforming rows

For the purpose of this series, we will only define basic transformations and will discuss useful extensions in the wrap-up post in the end. Let’s define our transformation language using the examples given above.

transformations = transformation { "then" transformation }transformation = "delete"
               | "replace" "field" ident "by" string
               | "search" regex "in" "field" ident "and" "replace" "with" string

This abstract syntax clarifies (hopefully) that the transformations are run sequentially in the given order. It is possible to read as a sentence, making it easier for non-technical people to understand.

This language over simplifies the described model as it doesn’t mention the external context. This will be discussed in the wrap-up post as an extension.

Validating the DSL

As mentioned above, basing the DSL concepts on concrete problem instances is a good way to validate that the DSL is expressive enough to at least describe these problems. As an example in this article, let’s write three rules:

all routes whose name is a “U” followed by a number and with type 400 should have type 402;
all “ (Berlin)” appearing in the end of a stop name should be removed;
stops with the identifier “old-id” should be removed.

The rules could look like this with the above syntax:

rule ubahn_type {
  when field route_short_name matches /U\d+/ and field route_type == "400"
  do replace field route_type by "402"
}rule remove_berlin {
  when field stop_name matches /.* (Berlin)$/
  do search / (Berlin)$/ in field stop_name and replace with ""
}rule remove_old_station {
  when field stop_id == "old-id"
  do delete
}

Of course this is not a complete validation of the expressiveness but gives a good overview of how this DSL allows for expressing these problems.

These examples also show that every rule is independently defined from the others, so that there is no complex nesting of conditions, which makes every rule self-contained and clearly defines the condition. This also means that our DSL needs to define a matching semantic when several rules could be applied. In the remainder of this series, we will decide that the first matching rule only is applied, so order of definition of rules is important: define the most specific cases first, and most generic last.

Conclusion

It’s time to conclude this first post of the series. By now, it should be clear how we can leverage a DSL to:

easily add or remove declarative transformation rules when new problems appear or disappear;
have a syntax that reads quite naturally and doesn’t require much programming experience to grasp.

The example above is of course just a toy example and lacks some features to be completely usable in a production setting. This gives an overview of how DSL for business rules can be defined and, hopefully, shows how simple it can be to reach an expressive enough DSL to address your problem space.

We will use the concepts we defined in it in the next episode to show how it can be implemented in Scala with fs2 and show how easy such an implementation can be.