Engineering Reliability — Part 1
One of the most important feature in any transactional app is Reliability. However, while engineering it, many nuances can get glossed over . This article, is first in a series of blogs, describing some of the patterns in engineering reliability in Angel’s new trading platform.
Life of an Order
Orders placement/management happens via interactions of few independent services — namely :
- App — which is the portal for the customer
- OMS/RMS — Order and Risk Management system; the core service which evaluates risk for the order and then submits it to the Exchange.
- Angel Trading Service — an orchestration and abstraction over the OMS/RMS system. This service abstracts the complexities/quirks of the OMS/RMS systems from the rest of the software systems
- Exchange — e.g NSE/BSE
The following figure describes the high level flow for placing orders (buy/sell). Order modifications have similar flow
The above interaction is typical of distributed systems — where a bunch of service collaborate to fulfil a single feature for the customer. Since there are many moving parts, a bunch of incidents can happen, including:
- The App or any other Service may experience network discontinuity
- A network element anywhere in the chain may experience latency
- A machine can crash
- The App can crash
Generally these problems are intermittent, and the way around is Timeout-And-Retry. Essentially if the App does not hear back on order confirmation within a timeout, it re-tries the order flow.
But there is a problem here — it might be possible that the order actually has been placed, but just the confirmation is delayed/lost. A retry here will result in duplicate orders. To protect against this, we need the overall system to be idempotent — and reject duplicate orders. The following sequence diagram describes the pattern :
While the above pattern is a generic recipe, there are few details which need to be engineered. Here is an indicative list
The code that we write in the Backend services needs to be stateless. Like the computer science concept of Pure Functions, the APIs (Application Programming Interface) needs to not have any side effects. This enables idempotent retries to not have any “side effects” . To achieve this, we need to avoid things like mutable in-memory caches and sticky sessions (via cookies) — which can cause behaviour changes on retries.
A single machine (node) failure should not result in data loss. For example in the order store above, loosing a stored “order id” would be disastrous and lead to duplicate or missing orders. To enable this we need few things :
- The database needs to be replicated over a set of machines. There is always one elected Leader node which handles the write so that race conditions are avoided.
- Writes (mutations) to the database need to be hardened on more than one node. To be specific the write should be “in sync” i.e the write should return only when required number of machines have accepted the change.
- Reads should be able to handle Leader failure, and connect to a newly elected Leader
- We use Enterprise Redis clusters for this. The replication and consistency is enforced by the WAIT command. The atomic operation of “Check if not present and Insert ” is handled by the Redis SETNX operation.
In the diagram above, there is an asynchronous push of services to the app. We need few guardrails to make this robust, including
- Ensure that there is a socket connection alive from the app to the backend. This needs a health check protocol (ping/pong) and handling scenarios like automatic socket closures by the phone OS etc
- Have a “pull” backup for the push, which the client can use when there is no push response within an expected time.
Retry v/s New Order
There might be a question — how do we differentiate new order versus a retried order. All systemic retries will use the same order id. However in other cases, there is no “algorithmic” way to do this — we rely on inferring “retry” from product flows.
Containing Blast Radius
While the strategies above work to ensure intermittent failures are handled gracefully, we still want to ensure that the impact of a failure impacts the least amount of customers. To enable this, we shard our infrastructure and map users to different shards. This is depicted below :
Handling Larger Failures
Once is blue moon, failures are severe — for example if there is an internet or power discontinuity to the datacenter. In this case , we need a different strategy — we need to failover entire systems to a different datacenter. This requires specialised engineering, from flipping DNS entries to ensuring that the backup services get primed. This is a larger topic and will be covered in a followup blog.