Using Finite State Machines in Baxi’s Ride Booking System

Published in

Baxi Voice

14 min readApr 18, 2017

A Finite State Machine (also known as a Finite State Automaton) is a mathematical model of computation which describes a definite state of a system at a given point of time. This blog post describes how we overcame the typical drawbacks attributed to FSMs and used its core concepts to architect a reliable, maintainable, and (most importantly) highly scalable back-end.

tl;dr : We used an “ancient” mathematical computational model called Finite State Machines and used it to build a (modern) kick-ass core back-end system.

1. Life, the Tech Universe, and Everything…

Before diving in to the nuances of the technical considerations and the thought process that went into building a modern, production-ready system based on a primitive computing concept, here’s some background to our tech stack and our code philosophy.

1.1 The Back-End Tech Stack
Here at Baxi, we’ve come to adore JavaScript. A lot of our initial code was written in PHP, but we’ve now completely migrated out to NodeJS for better performance. Our RESTful APIs are written in NodeJS with Express. Though we use MySQL for transactional data, everything else is stored in MongoDB, which gives us a bunch of awesome features like geo-queries which we rely upon heavily.

All of our infrastructure is hosted on AWS.

1.2 Code Organisation and Structure
Here’s a quick look at the way our back-end code is structured.

1.3 Code Philosophy and Patterns
Back in the days of yore, three wise men (in all honesty, they were just the founding members of Baxi’s engineering team) pooled together their collective wisdom (gained from numerous personal projects) and defined the best practices for the core API codebase, which culminated in the code organisation above.

Here is a quick explanation of the code structure -

i) server.js — This is the main server file which the NodeJS process runs. It is the entry point anything going in to the API.
ii) middleware — This contains the middleware that runs on all HTTP requests before it hits the code base. For us, this handles authentication (to identify a user) as well as authorisation (to check if the requested resource is permitted to be accessed by the user accessing it). Being something that runs on each request, it shields our core functions from unauthenticated and unauthorised requests, and acts as a filter for malicious requests as they never go beyond this point.
iii) routes — This comprises of the routing decision layer. Simply put, it is a map of HTTP verbs (GET/PUT/POST/DELETE) against the code blocks to be executed for each route.
iv) controllers — Controllers handle validation on data and make basic decisions for orchestrating the sequence of micro-services being called.
v) state-machines — This contains the FSM definitions and related bindings. This is where all the FSM states and transitions are defined.
vi) services — This is where the magic happens. These are data driven micro-services which take a set of input parameters and return a result after processing it based on the business logic.
vii) models — Data manipulation layer. This an abstraction written over data CRUD operations in the databases.
viii) config — Contains the system-wide parameters. These are globally accessible constants which define code behaviour. It also contains API response structures and database connection strings.

2. Finite State Machines

A Finite State Machine (or a Finite State Automaton) is an abstract machine that can be exactly one of a finite number of possible states at a given point of time. A state change (also referred to as a transition) may occur in response to an external input. A Finite State Machine is sufficiently described by i) the initial state, ii) all possible states, and iii) Conditions for each possible transition.

FSMs were a rage back in the days of game development to control execution flows. As expectations and requirements grew, people become more wary of using FSMs and eventually wrote it off as a thing of the past as more and more easy to use frameworks and patterns emerged, which provide easier abstractions and do not require programmers to interact with state machines directly.

2.1 Benefits of using FSMs

2.1.1 Stability
Since FSMs describe the state of a system (which is driven by data in most of the scenarios), they can be leveraged to provide stability to a system. This means that at any given point of time, the state of the system (read: the set of data defining the system at that instant) is predictable and consistent. This simply means that one can be sure of the conditions that have led the system to be in such a state, and also of the conditions and operations that are allowed, given the current set of data (read: the current state of the system).

2.1.2 Generalisation
With Finite State Machines, generalisation of logic and validation of data become very easy to tackle. Since an FSM’s state is essentially defined by a set of data, the easiest interpretation of the a basic validation is— Given the state of the system, is the operation being attempted allowed?

Prevention is better than cure, and with this philosophy, one can just block out-of-sequence operations (as compared to letting them run, and then correcting data and initiating a rollback on receiving unexpected results). This approach looked very promising to us, since any out-of-sequence request is preemptively blocked from actual execution. With respect to the logical sequence of events while booking a ride, this implies that rides that haven’t been started can’t be ended, and rides that haven’t been ended can’t be paid for, and rides that haven’t been paid for can’t be rated. As obvious as it sounds (I’m pretty sure some of you would be like duuuhhh at this point of time), it is a big problem for us programmers. The best part of it all? This sequence-sanity can be maintained in a generic way, without writing separate validations for each sequence.

2.1.3 Readability
FSMs can be leveraged to build pragmatic and intuitive APIs. This stems from the very obvious connection between real world events and the events that trigger state changes on a Finite State Machine, and keeps APIs readable and intuitive.

For example, in Pacman, an event called MOVE could be used to actually move Pacman, and change his state from standing to moving. Similarly, an event called COLLIDE could be used to trigger a collision between Pacman and a ghost, change his state from moving to collided.

2.1.4 Maintainability
FSMs allow for very high degrees of modularity, since theoretically each event can have a totally isolated, independent piece of code. Moreover, if the state transition logic is written keeping idempotence in mind, making modifications to business logic becomes very easy. Changing the sequence of events would then become a matter of changing the wiring in the rules defined in state-machines, without having to touch the underlying business logic or having to write any new validations or checks. Very convenient, right?

2.2 Drawbacks and practical problems
The world, however, is not rainbows and unicorns. Convenience in certain aspects has to be traded off against inconvenience in other aspects. You’d always have to be inconvenienced to write chunks of code to build something that provides convenience in the longer term.

Running FSMs in production applications is, by no means, a breeze. As we found out during our quest to build a near-ideal system using pure Finite State Automatons, we crossed paths with several practical problems that have (possibly) deterred people from using a pure FSM for their core business flows.

2.2.1 Concurrency
Gone are the days when large physical servers were used to run live production code. Most modern-day systems use cloud infrastructure, where smaller Virtual Machines dominate the cyberspace and Load Balancers are used generously. Though it is a boon maintaining infrastructure, it is a slippery slope for back-end programmers, as they’d have to worry about issues arising due to high degrees of concurrency in such setups.

Imagine a simple scenario where there are two servers, (let’s call them Alice and Bob), running behind a Load Balancer. A client, due to poor network conditions, fails to receive an acknowledgement for a request that was just sent, and so sends the same API request again, resulting in the request being sent twice in quick succession (let’s call these requests R1 and R2). R1 is routed to Alice, but before Alice sends out a response, the Load Balancer receives R2. Aware of the fact that Alice is probably busy, the Load Balancer routes R2 to Bob. This would end up being problematic in the context of state machines. This is because by design, Finite State Machines are meant to operate on definite, finite number of known states, and not intermediate state where rule sets aren’t defined. Unexpected states would create unexpected problems, which can even cause race-around conditions, from which a system can not automatically recover. In a programmer’s world, that is catastrophic.

2.2.2 Scalability
Most FSM implementations are written to be used in-memory, in a single thread, and do not provide for any form of persistence in shared memory. Due to this, it is not easy to scale load horizontally simply by adding more servers to the fray. This, compounded with the problem of concurrency described earlier makes it a counter-intuitive choice for use in production systems.

2.2.3 Unorthodox application
Personally, I haven’t come across a single back-end architecture which relies entirely on a pure Finite State Machine. After encountering the problems mentioned so far, I can imagine why it would discourage people from using state machines in production.

Due to the low adoption of FSMs in modern back-end architectures, there aren’t many resources or established best practices regarding the usage of FSMs in high-traffic production systems. It was fairly difficult for us to plan everything in such a way that we could get maximum benefits out of this while avoiding redundant code at the same time. There were times when potential problems got the best of us, and ‘let’s wing it’ became our motto for tackling them.

2.2.4 Libraries and supporting resources
Almost all of the FSM implementations that we explored had some drawback or the other which prevented us from locking in on one for use on a live system. They were mostly synchronous, which would be inherently problematic because a lot of real world operations involve asynchronous database read/writes.

A quick look at most of the libraries also revealed that they were written to be reactive, as compared to something that causes events to be triggered on a system. This was, again, not in line with our idea of Finite State Machines as we were looking for something that would act on our system, and not merely react to things happening on it.

2.2.5 Integration with live systems
Design decisions are even more difficult if they’re to be made for an existing system that is already running in production for a fairly large number of users. A lot of our design and architectural decisions ended up being heavily influenced by our first set of legacy code written in PHP. This was because around 75,000 of our users were already using the PHP APIs, and upgrading them to the new FSM-based system was something that would have to be done gradually and not in a single-shot compulsory upgrade.

To move out of a widely used system, a lot of iterations are usually needed before the elusive production-perfect system can be created and deployed — a system which strikes a balance between operational convenience and legacy support. The number of iterations goes up exponentially when FSMs are involved!

2.2.6 Implementation hurdles
Since all of the FSM libraries that we explored didn’t fit in with our idea of how we’d use them, we zeroed in on one, considering ease of use as the deciding factor. However, by doing so, we also created a few hurdles for ourselves on the way. For asynchronous state transitions, explicit intervention was required to register a successful state transition. This was problematic since there would be multiple entry points in the system that would trigger transitions, and if we were to manually call transitions at each one of them, we were probably better off not using FSMs at all! Any self-respecting programmer would go to great lengths to avoid redundant code anyway. To avoid writing redundant code to manually trigger transitions at each exit point, we needed to find a generic way of doing so.

2.3 FSMs in Baxi’s ride booking system

2.4 Explanations and examples
Partly due to our legacy PHP APIs, and partly due to the potential business use-cases, we decided to have two separate state machines for the ride booking flow. One state machine would handle ride requests from a passenger’s perspective, and another state machine would handle ride requests for a driver. The reason for this was that on receiving a ride request, drivers on Baxi’s platform have an option to either accept or reject the request. In case the driver does not accept the request, a new ride request is generated for another driver. One ride request from a passenger can generate many ride requests for drivers, one for each driver in the vicinity. Each customer and each driver, at a given point of time in the course of a booking request, only has visibility and authorisation to act on their own ride FSM. The overall orchestration is written in such a way that both the customer and the driver FSMs react to events on each other.

For example, a ride cancellation can be request either from the customer or from the driver, and either of the state machines are allowed to trigger off a Cancel event on the other.

2.5 Overcoming the problems
2.5.1 Code structuring: Being smart
Circular dependency errors are a pain to debug in large codebases (like ours, even more so since we use JavaScript, which is brilliant if you like run-time crashes). To handle such errors, the trick lies in organising the code smartly. As a general thumb rule — never break the hierarchy of organisation, unless the file making an exception acts like an isolated interface. For example, services are higher in the hierarchy than models. So models should not import a file present in services. Similarly, controllers can include services or even models, but services or models can not include controllers. services can import state-machines as an exception, only if it’s an isolated service and is imported at exactly one interface. This strategy works really well in avoiding any confusion that may introduce circular dependencies in the code. Extra care was taken to ensure that the services using state machines were structured in such a way that they were used in complete isolation, and only by a controller that handled orchestration of events for the state machine.

2.5.2 Pragmatism: Using actions to initiate transitions
Designing a good API is like designing a UI for programmers. It has to be very intuitive, and is probably not done right if it requires much explanation. For this to happen in such a way that APIs are readable for both the programmer consuming the APIs and for the developer maintaining them — some common ground needs to be found. So, while designing the API, we decided to name the input parameters in such a way that they coincided with the relevant FSM event names.

The way we envisioned this to work was that these input parameters to the API would attempt to trigger off an event of the same name on the FSM. Additional parameters along with event would be used for validation and operation.

For example — to Finish a ride, the input parameters for the trip finish API would look something like this -

Sample Input for the Finish Ride API

2.5.3 Concurrency
Due to the drawbacks mentioned previously in this post, the below approach is what works best -

Lock FSM → Instruct the FSM to start an asynchronous transition → Execute business logic → Trigger transition to commit, or cancel transition to rollback → Release lock.

If a duplicate request arrives during the time in which the request is already being processed, then the lock discards the duplicate request right away.

2.5.4 Scalability
Since the FSM library of our choice was designed to retain the current state in memory, we needed to ensure that different actors (passengers and drivers) are able to interact with the system independently and yet, at the same time, maintain the synchronisation needed between their respective state machines.

For this, we decided to persist a log of the state transitions in a database, which came at the cost of a small delay incurred by the database read. But this was okay, since it a small trade-off that solved the problem of scalability for us. With this arrangement, any incoming request would only operate on an instantaneous state machine, which would always follow the construct → operate → persist the successful transition in the database → deconstruct cycle.

2.5.5 Asynchronous behaviour
The choice of library forced us in to awkward corners — from where only true JavaScript Ninjas could hole out of unscathed. The FSM library that we chose to use was reactive by design. From the cycle above, the only missing piece of the puzzle was operate. It provided no functionality out of the box that would let us do custom operations along with executing an asynchronous state transition. The only way around this was to manipulate existing functions in such a way that we could inject business logic between the parts where a state transition is triggered and where it is successfully committed… all while handling errors and exceptions in the massive code written to handle our business logic.

The snippet below shows how we pulled off this near-magical feat.

Source code for the final state machine implementation

This, in my opinion has, by far, been one of the most interesting things that I’ve ever done with JavaScript. The FSM callbacks call a generic handler, which takes the function that executes the business logic for us as an input. Then, the FSM’s callback is extracted, and replaced by the business logic function, and then once the business logic is executed, the FSM’s original callback is called. Things like logging the transition, and instructing the state machine on whether a state change needs to be registered are also handled generically here. (No redundant code, yay.) If you’re looking for a more detailed explanation, drop me a message and I’d be happy to explain it to you!

3. Conclusion
Though Finite State Machines are a core computational model, we ran in to numerous problems on our quest to implement it on a production level system. This was further complicated by a lack of reference material and resources. However, in retrospect, despite all these hurdles, using FSMs turned out to be very beneficial for us in the end. It was just a matter of taking the risk of spending time on exploring the road less travelled, and it paid off really well — I can vouch for the fact that it was one of the best design decision we have ever taken at Baxi.

As a result, maintenance and modifications to the back-end have become really easy now. New flows can be added at mind-boggling speeds. In fact, we recently introduced a new feature in our app where users are able to connect with autorickshaw drivers over a phone call through the app. Despite a fair amount of complexity, it took me a little under an hour and a half from idea to production— simply because the modular FSM design gives us the confidence and flexibility to write, test and deploy code quickly.

Using Finite State Machines in Baxi’s Ride Booking System

1. Life, the Tech Universe, and Everything…

Written by Sahil Narain