Soft Real-time, Fault Tolerant and Scalable E-Commerce Transaction Platform @ redBus powered by Erlang/OTP

An E-commerce Transaction Platform is required to maintain the complete life-cycle of an item; namely its creation, mutation through various intermediate states and subsequently analytics of live or historical data. This blog takes you through the journey of rewriting the decade-old system (formerly in Java) in Erlang/OTP. The usage of Erlang/OTP influenced us to take the concept of responsiveness even further and ensure that the overall system (including its use of external data stores) remain responsive at all times.

The Motivation

“If it ain’t broke, don’t fix it.”

Interestingly nothing was broken, and yet we changed it. The decade-old system worked, but then it “just” worked. It was getting difficult to add new features, so whenever changes happened, it used to fail in interestingly new ways enough to cause anxiousness among the developers and QA (most of all the managers). There was another issue which was lurking around — data management was broken and had to be manually dealt with to keep the system optimized. Due to the earlier constraints, there were numerous workarounds built in surrounding systems to launch new features and ensure the competitive edge.
There were a couple of choices; namely either refactor the existing codebase incrementally or rewrite it entirely. The decision was not without its complexities since any re-architecture would impact rest of the systems as well. It was a tough nut to crack, but we decided to bite the bullet and attempt a rewrite.

Why move away from Java to Erlang/OTP?

There was nothing in favor of Erlang/OTP, at least on the surface.

  • Developers had no clue about Erlang/OTP (barring one)
  • NewRelic is used extensively for application monitoring within the organization, while it does not support Erlang.
  • Although Erlang/OTP has proven production use cases, it was a new concept within the organization.

So, why did we choose an entirely new technology while replacing a proven workhorse (Java)? The short answer is “Erlang satisfies all the core requirements with the help of battle-tested BEAM (soft real-time virtual machine) and OTP (modules and standards).” The Actor based modeling was a big win as well while building complex functionality seamlessly. The programming language is a bit of an issue, considering most of the devs were working with c-like programming languages since the beginning. It was a substantial amount of work getting around this, but then this is another story.

The Guiding Principles

The following guiding principles are under consideration while discussing various strategies and chosen alternatives. It is important to note that each one of these principles is not to be implemented in their purest form primarily to satisfy the rest.

  • Soft Real-time
  • Move forward release (rollback is not an option)
  • Strong Consistency
  • Fault Tolerant
  • Scalable
  • Blazingly fast time to market of new features
  • Clear definition of failure
  • Automatic Data Management
  • Consistent and guaranteed asynchronous event system
  • Self Healing
  • Self Monitoring
  • Live Analytics
  • Retain Old Data (no loss of old information)
  • Zero Loss Migration
  • Data Fall-back Mechanism Across Polyglot DataStores

The design and architecture of the system is a result of trying to satisfy all the requirements. The Erlang/OTP programming language is best suited for many of the requirements; namely soft real-time, fault-tolerance, scalability and clear error handling semantics.

Why is soft real-time so crucial in production?

An observant reader would notice that speed of execution does not appear in the guiding principles. The performance aspect (especially the speed of execution) is intentionally left out. The compute capacity is continuously improving, and any modern software architecture would be incomplete without dealing with scalability.
Huh! Still, that does not answer why soft real-time is so critical. If you are using a digital equipment long enough (especially a computer), then you would realize that it becomes unresponsive occasionally. It is catastrophic for production systems. Think about running an offer on an e-commerce site, while looking at meager sales chart due to systems crumble under load. How about systems choke due to application software bugs where certain functionality consumes most of the CPU while starving the rest. Sounds familiar? You would need a lot more expert programmers writing concurrent functionality correctly in traditional programming languages than in Erlang.

Erlang Cowboy Process Tree

Lets look at the Erlang Cowboy (an Erlang framework for building HTTP servers) process tree. The business logic is ultimately executed within the “protocol process” Erlang process. It must be clear now that every client request will have a separate Erlang process, so each request is treated independently and with the same priority. Additionally note that the connection acceptor (in this case “ranch acceptor” Erlang process) is also a regular Erlang process. Although the diagram is an oversimplification of Erlang process tree in production, but I believe you get the picture (there will be many “ranch acceptor” Erlang processes in reality).

Additionally, the system must be capable of taking much more load than its max capacity to account for unpredictable spike in traffic. There will be some failures, but the system should continue to remain responsive (again soft real-time comes to the rescue) and serve provisioned traffic successfully at the same time. Our benchmarking suites in staging has proven that the system continue to serve successfully midst of growing failures with increasing load. Time and again the system has proven (in production) to take care of sudden burst of traffic successfully while witnessing limited failures. This earlier would have choked the system (for the duration of the spike).

“whatever can go wrong, will go wrong.” — Murphy’s law.

Soft real-time is an approximate guarantee, but it does make one bold statement.

“No single piece of logic can hog the entire system.”

This is the reason why we chose Erlang/OTP over any other programming language (framework) out there because we value our customers and treat all of them equally. No single transaction must hog the entire system. It would be fair to say that the next time you are booking at redBus, you get similar treatment as the rest of active customers. So, even if you are late in the game on the big-offer-sale period, we ensure that you do not miss the bus :)
There is a lot that goes on for building end-to-end soft real-time systems, but let’s leave that for later. Having said that, goodness within Erlang/OTP allowed us to achieve soft real-time behavior even while using external systems (say data stores). I believe you tend to get into the mindset of building responsive systems once your base infrastructure is designed with that as a primary objective and extend the logic all across your system.

How do we deal with failures?

Nothing known is left to chance. It is easier said than done, although given the overall objective you would typically deal with the following:

  • Monitor as much possible as a real-time metric which can be programmatically read back to enable self-monitoring.
  • Ensure that you log only the necessary.
  • Log errors on disk (with rotation) and post issues to external services like Sentry.
  • Log general statistics which require relationship as statistics log file, which then can be consumed by tools like Kibana.
  • Plan for live tracing (can you switch on event tracing dynamically?).
  • Package the application in a standard package (deb, rpm or docker if you like) and use a manager like systemd to restart in case of critical failures.
Erlang Supervisor Behavior
  • Have a clean unified approach to handle unknown failures within your application. The Erlang/OTP supervisor behavior works beautifully to solve this for us comprehensively.
  • Ensure that the system can handle known failures seamlessly and have an automated plan for the same. This includes retrying external services in case of network failures or using exponential backoff (or something of the sort) to avoid flood or even applying backpressure to avoid the dynamo effect. The usage of Erlang/OTP allowed us to quickly get into the mode of worrying about all the mentioned scenarios from day one because of message passing within the language.
  • Ensure that there is a data fall-back strategy to query alternate datastore (different class/type) when primary datastore fails to respond in time.
  • Nothing is permanent, so do not assume availability of any node (application or datastore). Plan for the worst so much so as to restart any node in production with minimal business loss and zero eventually consistent data loss.
Real-time overall API HTTP codes in production system
Real-time Specific API HTTP codes in production system

Extensive Real-time Monitoring

An extensive set of metrics are built (which are powered by DalmatinerDB) and served via Grafana). The measurement is with a resolution of 1 second and the screen-shots are just an illustration of possible visualisations. There are a lot more metrics (both business and system) which are continuously pushed to DalmatinerDB and eventually served via Grafana. These metrics can also be queried via DalmatinerDB DQL programmatically for automated system and business monitoring.

Metric Publishing and Visualization

We can roll-out new features faster than ever

The functional aspect of Erlang helped us a lot in keeping the codebase easy to reason and change. The lack of strong types did pose a challenge at first but when we embraced the use of dialyzer which was not so bad. Tools like Elvis also helped us in keeping the codebase sane. The functional nature (or should I mention functional enough) of Erlang proves to be very critical while doing code reviews or debugging sessions. The team still needs to follow good programming practice (no escaping that), but lack of overhead of OOPs did allow us taking a lot less time than it would have taken otherwise.
There is no fun in releasing software faster when your test cycle is much more longer and complicated. We introduced property-based testing (say hello to PropEr) to help us do just that. Although, much needs to be desired all the base models are well covered for now.

We can serve massive amount of historical data forever

This is fueled both by the choice of the base data access model along with polyglot data stores, especially a time series database (in this case RiakTS). The choice of data duplication allows faster data access patterns possible in new and interesting ways. This though is not without its challenges wherein data consistency (which would be eventual) becomes very critical. Once we were able to solve the guaranteed eventual consistency correctly the use of polyglot datastores allowed us to build a lot of complicated use cases faster than ever.

Polyglot DataStores

Are you truly soft real-time?

BEAM, which is the Erlang virtual machine has a concept of reductions. If you dig deep enough you would realize that it is not that hard to build your application which is end-to-end soft real-time when you completely embrace the concept of preemption. Additionally, you get many of the measurement hooks for free which can be run in production to measure system metrics within the business flow. This has very interesting ramifications, as we were able to build sufficiently simplified logic to measure and feed information back into the system to allow cost prediction possible. It is not that hard to build a simplified quality of service module on top of this logic. In case you are smart enough to derive the priority and cost of the service apart from current system resources attaching that to the fast or slow path become trivial in Erlang/OTP.

Think you know what you have implemented?

It is obvious that the importance of documentation cannot be emphasized enough. Erlang comes with edoc which gets some getting used to, because unlike other programming languages it fails in cases due to commented code (if at the wrong places). This was not so much of an issue because production code should anyway not have those leftovers. Although, edoc provides a nice feature set, but we chose to use slate for API documentation and sphinx for design documentation. The trio (edoc, slate and sphinx) worked beautifully for us in keeping up with the ever growing codebase and business scenarios.

Running in Production

There is a huge difference between writing software and writing a software for production. A lot of nuances must be handled (we have already discussed a couple of them), but everything must start with a release bundle. Irrespective of your eventual choice of package (deb, rpm or as installed app in docker) the software build and packaging must be seamless along with a provision for various profiles (test, prod, etc). Rebar3 helped us a lot doing all the heavy lifting and packaging everything together as a installable tar bundle. Subsequently installation can either be as simple as decompressing that bundle to some folder and run the application as “/path/app start” or packaging it as deb (or rpm package). Rebar3 does everything right and makes the pain of packaging go away.

Final Notes

This blog is a very high level description of the transaction platform in production at redBus. There are a lot of details which are missing so as to keep this blog both short and avoid discussion around business details. This project would not have been possible without the awesome set of open source libraries/frameworks like Erlang/OTP, Cowboy, RabbitMQ, RiakTS, DalmatnerDB, Grafana and many more.

References