Massively Scalable and Fault Tolerant E-Commerce Inventory Platform (Search) at redBus powered by Erlang/OTP

Search is the starting point in a typical E-Commerce platform, which needless to say must be highly available, scalable and sufficiently consistent to provide very low latency results accurately. There are unique challenges which must be solved comprehensively which do not exist while designing any other system. Contrary to the popular belief that “Erlang is SLOW”, this project should break some of that myth and suggest otherwise. It is needless to say that Erlang was never designed to be fast, but it do not stand in the way either. Some of the popular tricks will be discussed further in this blog.

It is comparatively easier to tune Erlang for speed (in a typical backend application) than to emulate its unique behaviour (like soft-realtime, actor model apart from many others).

Production Statistics

The following statistics are pulled out production at the time of this writing while the search cluster (4 EC2 c5.2xLarge) was at 30–40% of capacity. Note that a single search request would contain multiple buses and the application spends search and compute over the same (for transformation) before serving the result. At redBus a single search would result in multiple buses shown to the customer before he/she chooses to book one. It is important to note that the data is constantly changing because it contains seat availability, fare apart from other dynamically changing information. Instead of looking at number of requests per second, it is better to measure the system capacity in terms of cumulative number of buses shown to customers per second. Since, each search requests would vary in terms of number of search results (or buses), so a single measure of RPS (request per second) would be misleading.

  • Serving 60K+ buses / second
  • Average latency under 200 ms
  • 95th percentile search latency under 250 ms
  • 99th percentile search latency under 340 ms
  • 99.9th percentile search latency under 490 ms
CPU Usage in Search Cluster
Memory Usage in Search Cluster
95th Percentile Search Response Latency
99th Percentile Search Response Latency
99.9th Percentile Search Response Latency

The Motivation

We choose to completely rewrite the inventory platform (search) when things were seemingly working otherwise. The older system had a very strong focus on latency, but number of other aspects were not sufficiently handled. The idea of “just works most of the time” is very tricky and quickly leads to “whatever can go wrong, will go wrong”. Additionally, the curve for planning beyond 99% of cases gets steep really fast. The only way to staying ahead of the curve is to define one and continuously invest in optimal architectures and software solutions in achieving the same.

Why move away from Java to Erlang/OTP?

It may appear that Erlang/OTP is not a good choice when one of the core requirements is speed. At least this is what would appear if you are not reading between the lines. The following challenges existed while choosing Erlang/OTP as a replacement for Java.

  • There are numerous other alternatives which top the chart in terms of blazing fast programming languages.
  • Erlang is little known programming language, supposedly
  • Everyone on the team except one were new to Erlang/OTP.

The road to Erlang/OTP was challenging but there was an earlier successful attempt, so it was comparatively easier than before (see my earlier blog on Transaction Platform rewrite).

  • Erlang/OTP was running in production

Erlang/OTP had proven within the organization (in Transaction Platform) that it satisfied all the promises and more. The only stumbling block was with respect to speed of execution. At the outset it appeared to be a very difficult target, but it was soon discovered that a combination of great tooling (for performance tuning) and concurrency ensured that we beat the target.

The Guiding Principles

The following guiding principles are under consideration while discussing various strategies and chosen alternatives. It is important to note that each one of these principles is not to be implemented in their purest form primarily to satisfy the rest.

You would notice that there is a huge overlap when compared our earlier project (Transaction Platform in Erlang/OTP).
  • Very Low Latency (speed is KEY in search)
  • Get Faster with higher class of server (more CPU cores)
  • Strong Consistency
  • Fault Tolerant
  • Soft Real-time
  • Simultaneously support old and new infrastructure
  • Scalable
  • Fast time to market of new features
  • Clear definition of failures
  • Automatic Data Management
  • Self Monitoring
  • Live Analytics

Massive Hardware Reduction

The hardware utilization improved drastically within the new architecture powered by massive concurrency provided by Erlang. This in turn allowed us to drastically cut down AWS (Amazon Web Services) EC2 cost (which was also a result of certain data storage choices). Additionally, the new software rewritten in Erlang would run faster when we would vertically scale the AWS EC2 instances for higher number of cores. This was direct result of riding on top of the Erlang concurrency model.

We Get Faster With Each New Hardware Upgrade

The idea of getting faster with more CPU cores is not without its challenges. It is non-trivial for most of the applications to exploit higher number of CPU cores correctly. Erlang/OTP provides all the requisite guidelines and semantics to build a massively scalable and distributed application. This in turn allowed us to break a single search into multiple segments thereby achieving unbelievably low latency thereby beating the older Java implementation by a huge margin.

Intelligently breaking work in smaller parts is key to exploit Erlang’s massive concurrency model.

Gaining Execution Speed

You can gain speed in Erlang/OTP with the use of NIF (code written in C/C++/Rust) or breaking it into segments where each one is processed by a different actor (concurrency at work). We have used bunch of libraries to gain speed via NIFs, which are both fast and allow us to adhere to soft real-time as well. Ultimately if you are still not getting enough done then evaluate the third-party libraries and hand optimize wherever required (we did that too).

Gaining Consistent Search Results

We are exploiting AWS MySQL RDS to its core. It is not difficult to use SQL databases as Key-Value store, while exploiting its consistency and indexing properties.

Building a Cluster of Search Nodes

Do we really need a cluster?

There are multiple architectures while choosing in-memory stores, while each one has a cost associated with them. We are riding on strong distribution principles provided by Erlang/OTP to maintain a minimal in-memory distributed store with consistent and intelligent invalidation. This by par is the coolest feature while building the solution. The ease with which you could build such a solution in Erlang is simply amazing!

Why soft real-time is very important?

I have already discussed at length about the importance of soft real-time in my previous blog (see Transaction Platform in Erlang/OTP). It is very critical for the business that systems are always responsive even under heavy load.

Lazy is Good — At least most of the time

Do not cache information unless really required, but then prefer lazy loading and have strong measures to invalidate it as well (for stronger consistency). This simple decision reduces the boot-up time drastically at the expense of some slow calls while some of the metadata is loaded in instance memory.

How do we deal with failures?

Circuit breaker is a very common pattern which tries to protect downstream systems or databases, but it is very challenging to implement one correctly. There are various mechanism which must be employed before ensuring that other systems are protected from a request-flood. We use a combination of higher limits on pool (for each of external storage nodes) and individual circuit breakers to achieve the same. Additionally, Erlang/OTP provides beautiful semantics to recover from failures which allowed us to build reliable failure recovery and fallback mechanisms.

There are couple of strategies while using storage backend both for load balancing and availability. We choose to rely on build a simple solution to both balance the load on external storage systems and fallback (finding alternate when some failed) without another layer in between. As mentioned earlier, the search nodes are part of a cluster, so they balance the load on available storage backend among themselves.

Extensive Real-time Monitoring

The system generates huge amount of metric information and pushes out to DalmatinerDB in near real-time (every 5 seconds), thereby allowing us to fine-tune and understand the system a lot better. Additionally, we are in direct control of both the volume and rate of such information upload. It is needless to say that intelligent mechanism is in place to only publish the metrics which change or are required at the designated rate. This helps in maintaining very low overhead on system to capture and publish metrics in real-time.

Final Notes

I have intentionally left out bulk of details to keep this post short and focused. There is a lot of engineering which went into taking the system to production and meeting some of the performance numbers mentioned earlier. Additionally, this project is riding on the shoulders of an amazing set of open source libraries/frameworks like Erlang/OTP, Cowboy, RabbitMQ, DalmatnerDB, Grafana and many more.

References

  • Neeraj Sharma, Soft Real-time, Fault Tolerant and Scalable E-Commerce Transaction Platform @ redBus powered by Erlang/OTP. [Online]. Available: https://medium.com/redbus-in/soft-real-time-fault-tolerant-and-scalable-e-commerce-transaction-platform-redbus-powered-by-cf739d3cff57 (visited on 07/23/2018).
  • J. Armstrong and R. Virding, Erlang — an experimental telephony programming language, in XIII International Switching Symposium, 1990, pp. 4348. doi: 10.1109/ISS.1990.765711. Ericsson AB. (1999). Erlang/OTP, [Online]. Available: http://www.erlang.org (visited on 07/23/2018).
  • S. Vinoski, Concurrency with Erlang, Internet Computing, IEEE, vol. 11, no. 5, pp. 9093, 2007. doi: 10.1109/MIC.2007. 104. J. Armstrong, A history of Erlang, in Proceedings of the Third ACM SIGPLAN Conference on History of Programming Languages, ser. HOPL III, San Diego, California: ACM, 2007, pp. 6–16–26, isbn: 978–1–59593–766–7. doi: 10.1145/1238844. 1238850.
  • J. Larson, Erlang for concurrent programming, Communications of the ACM, vol. 52, no. 3, pp. 4856, 2009. doi: 10.1145/ 1454456.1454463. Erlang documentation,
  • [Online]. Available: http://www.erlang.org/documentation (visited on 07/23/2018).
  • [Online]. Available: https://propertesting.com/ (visited on 07/23/2018).