Designing a GTFS Business Rule Engine — Part 2

Published in

Mobimeo Technology

9 min readAug 12, 2021

In the first part of this blog post series, we sketched basic ideas and proposed a simple design for a DSL describing a match/transform processing pipeline for GTFS files. This was the occasion to have a nice high-level overview of the domain under discussion in this series and allowed us to scope the problem we want to address. It is now time to get to the heart of the matter, and to dive into an implementation of the discussed concepts. In this post, we will present a nice functional way of implementing a business rule engine for the DSL described in the first episode, as well as a few underlying libraries used to do it. Fasten your seat belt, and let’s go for it!

The tech stack

The code described in this post is written in Scala, and is a simplified implementation that aims to be self contained for the purpose of this post. The code relies on the Typelevel ecosystem, and more specifically fs2, a functional stream library. Using fs2 allows us to leverage the sequential nature of the dataset handled when dealing with GTFS data. As we shall see in the remainder of this post, this allows for expressing the implementation in a concise and natural way, allowing for easy maintenance, composition, and extension (isn’t it what functional programming is also about after all?).

So, first of all, let’s deal with the administrative overhead, by importing all the love we need.

We first make sure that fs2 streams are in scope. Then, we import all the cats extension methods into scope to get all our types on steroids. These two imports provide us with 90% of all we need for our engine.

Reading CSV files

As the core of GTFS data are CSV files, which means that the first thing we need is a way of reading these files and to provide a stream of the rows. Even though as a first approximation it is an easy task (a row is a line, split at commas, and there you go), having a robust method to do it, and handling corner cases properly is crucial to be sure you can handle all kinds of CSV files. Every corner case will likely appear in the various data you will get from various sources. To this end, we will base our tiny library on the CSV module provided by the fs2-data library.

Using this library is as simple as a single import after having fs2 in scope.

Now that this part is done, we can get to the meat of it, and define a small function that will allow us to parse a stream of characters into a stream of CSV rows. In GTFS, the first row contains the headers, so let’s parse with this knowledge.

CsvRow is a simple class provided by fs2-data that allows for safely accessing the row fields by name or index. This function parses and decodes a stream of characters into a sequence of valid CSV rows that we can later operate on. For the purpose of our processing engine, we won’t need more structured data, but more on this later.

The RaiseThrowable context bound on F indicates that the stream might fail when parsed (in case of invalid CSV input).

In a real world application, the CSV files would be read from files within a ZIP file, to help with this, we could use the fs2-io, which provides a streaming I/O API (reading bytes from files, decoding text, …). We won’t go into these details in this post but it doesn’t change the remainder of the code. One good thing about our functional approach with fs2 is that we can compose and separate the various phases of the processing without changing the core of it.

In our case we can create a simple stream of characters by using a String, for instance:

This comes in handy for testing without having to perform any I/O.

The data model

The purpose of this post is to present an implementation of our DSL, and we will take a tree walk approach for this in this post. We might go into different implementations later on in follow-up posts, but this approach has the advantage of being quite intuitive and easy to implement, as we shall see.

One of the first things we can do is to define the AST for the language described in the previous post.

This data model follows closely the language we described in the previous post:

A rule consists of a name, a row matcher, and transformations to apply
A matcher is a simple expression language with disjunction and conjunction
A transformation either deletes a row or replaces the value

This is simple enough and yet allows for modelling a wide collection of transformation rules.

Now that we have defined the data model, we can write some transformation rules. Let’s see how we can encode the rules described in the previous post.

This is a bit verbose, but the ADT defined for this follows closely the abstract syntax, which makes it straightforward to map from the concrete syntax from the previous post to the data model. Of course, it is not handy in real world applications, and we would need to define an easier way to describe the rules. More on this in a future post.

Now that the first step is done, we can now dive deeper into the actual essence of our little project: the rule interpreter itself.

Interpreting the rules

Interpreting the rules is done in several steps:

Checking whether the rule matches the current row
If yes, apply the according transformations
If no, try the next rule

This is done for every row in the CSV files, in order. Let’s describe each step individually, and then glue everything together. This is something our functional approach makes super easy to perform.

The first step is the matching step. Given a CSV row and a matcher, determine whether the row matches the criteria.

This function goes over all possible matcher cases and recursively checks whether the predicate holds for the provided row. The recursive nature of the function stems from the structure of matcher expressions, in which simple expressions can be combined into more complex ones. Both base cases simply check that the field exists in the row, and that the value within the field equals the expected value or matches the regular expression. row(field) returns None if it doesn’t exist in the row, which results in false, i.e. not matching.

Now that we know how to match a row according to a predicate, let’s dive into the interpretation of the transformations.

As usual, let’s start with the atomic function that applies a single rule to a row. This function returns the transformed row, if any, or None if the row has to be deleted. We make use of the none and .some cats extensions, which help with type inference compared to using the Option constructors, you can have a look at this blog post for some helpful tools provided by the library. We take advantage of the power provided by the CsvRow class to concisely write the transformation on the row. You can refer to the class documentation for more details.
So far, we have defined the basic operations of matching a row and applying a single transformation. It is now time to assemble these functions to select and apply the first matching rule in a set of rules. As usual, having a functional programming style makes it super concise and easy to write, avoiding a lot of if-then-else statements by leveraging the expressiveness and power brought by higher-order functions.

The find function selects the first rule that matches, returning it, or None if none match. If no rule exists for this row, we simply return it unchanged, otherwise, the rule’s transformations are applied sequentially in order of definition. If the row is deleted, subsequent transformations have no effect. The semantics defined here is “first match wins”, so only the first matching rule of a set of rules is selected. If several transformations are required for potentially overlapping rows, you can define several rule sets, and apply them one after the other. There are possibilities to make it work efficiently without having to actually go several times over the files to apply several sets of transformation rules. But more on that in a later post.

Also note in the snippet above that we always go over the full list of transformations even if one deleted the row and no transformation can be applied anymore. The combination of foldLeft and flatMap makes it work as expected, but we could short circuit as soon as a transformation deletes a row. This could be done by using foldM or tailRecM from cats.

We now have all the machinery in place to process a single row. The last step is to iterate over all the rows in a CSV file, and create the new file out of the processing of each row. To this end, we can use the powerful fs2 operators. The good news is: there are two operators that look like they were made for this blog post, so the end result is super easy to write.

Say we have a stream of CSV rows, as we discussed above in this post, we can map on every row and transform it, resulting in a stream of Option[CsvRow[String]], where None represents a row that was deleted. We can then use the unNone operator that drops every None in the stream and extract the Some value, returning a stream of CsvRow[String] containing the result of applying the rules to each row in the original CSV stream.

The last step is to read a CSV file into a stream of CSV rows, pipe the stream into our transformation pipeline and execute it, to collect the result into a new CSV file. This can be done by a simple process function.

The encodeUsingFirstHeaders() pipe serializes the resulting rows into string, escaping what needs to be escaped to get a valid CSV file back. This is part of the fs2-data library, doing all the work for us. We use the Monoid instance available on strings from cats, to concatenate all the encoded rows into one single string. This requires some basic knowledge of category theory, which cats users are familiar with, but allows for expressing powerful, well defined operations in a concise way.

And our work here is done!

Show me it actually works!

Well, you don’t have to blindly trust this assertion, and it’s always better to make sure the code we wrote is actually doing what we expect it to do. In a real world project, we would write unit tests for each matcher and transformation, testing all the semantics we defined in the previous post. For the purpose of this post, we will only test it on a simple example CSV file. Let’s assume we have these two CSV files, one representing routes, the other representing stops. We will test both our rule sets on these files.

Applying the stop rules to the stops CSV file, should remove the stop with identifier stop3 and drop the (Berlin) in the end of the stop names if present. Let’s verify this right away by calling the process function on our sample CSV file.

This looks indeed promising. This example shows that the following semantics seems to be implemented as expected:

The first matching rule is selected
Rows can be deleted
Regular expressions replacement is working
Rows with no matching rules are left unchanged

Similarly, we can apply the second set of rules to the route CSV file, to check that values can be replaced.

As expected, the 400 is replaced by 402.

Conclusion

This is it for the second episode of this series of posts. We saw how easy and concise the implementation of the rule engine we defined in the previous post can be. By now you should have a clear overview of how this can be done, and be able to simply add new matchers or transformations. We kept things simple with non side-effecting rules, which we might cover in a subsequent post, but this shows the concepts that can be used to implement this kind of processing engine.
If you want to try out the code introduced in this post, and extend it, you can do it directly from your browser, by going to this scastie.