Engineering Reliability — Part 1

Published in

Angel One Square

4 min readJul 16, 2022

One of the most important feature in any transactional app is Reliability. However, while engineering it, many nuances can get glossed over . This article, is first in a series of blogs, describing some of the patterns in engineering reliability in Angel’s new trading platform.

Life of an Order

Orders placement/management happens via interactions of few independent services — namely :

App — which is the portal for the customer
OMS/RMS — Order and Risk Management system; the core service which evaluates risk for the order and then submits it to the Exchange.
Angel Trading Service — an orchestration and abstraction over the OMS/RMS system. This service abstracts the complexities/quirks of the OMS/RMS systems from the rest of the software systems
Exchange — e.g NSE/BSE

The following figure describes the high level flow for placing orders (buy/sell). Order modifications have similar flow

Failure Scenarios

The above interaction is typical of distributed systems — where a bunch of service collaborate to fulfil a single feature for the customer. Since there are many moving parts, a bunch of incidents can happen, including:

The App or any other Service may experience network discontinuity
A network element anywhere in the chain may experience latency
A machine can crash
The App can crash

Generally these problems are intermittent, and the way around is Timeout-And-Retry. Essentially if the App does not hear back on order confirmation within a timeout, it re-tries the order flow.

But there is a problem here — it might be possible that the order actually has been placed, but just the confirmation is delayed/lost. A retry here will result in duplicate orders. To protect against this, we need the overall system to be idempotent — and reject duplicate orders. The following sequence diagram describes the pattern :

Some Details

While the above pattern is a generic recipe, there are few details which need to be engineered. Here is an indicative list

Stateless Services

The code that we write in the Backend services needs to be stateless. Like the computer science concept of Pure Functions, the APIs (Application Programming Interface) needs to not have any side effects. This enables idempotent retries to not have any “side effects” . To achieve this, we need to avoid things like mutable in-memory caches and sticky sessions (via cookies) — which can cause behaviour changes on retries.

Machine Failures

A single machine (node) failure should not result in data loss. For example in the order store above, loosing a stored “order id” would be disastrous and lead to duplicate or missing orders. To enable this we need few things :

The database needs to be replicated over a set of machines. There is always one elected Leader node which handles the write so that race conditions are avoided.
Writes (mutations) to the database need to be hardened on more than one node. To be specific the write should be “in sync” i.e the write should return only when required number of machines have accepted the change.
Reads should be able to handle Leader failure, and connect to a newly elected Leader
We use Enterprise Redis clusters for this. The replication and consistency is enforced by the WAIT command. The atomic operation of “Check if not present and Insert ” is handled by the Redis SETNX operation.

Reliable Push

In the diagram above, there is an asynchronous push of services to the app. We need few guardrails to make this robust, including

Ensure that there is a socket connection alive from the app to the backend. This needs a health check protocol (ping/pong) and handling scenarios like automatic socket closures by the phone OS etc
Have a “pull” backup for the push, which the client can use when there is no push response within an expected time.

Retry v/s New Order

There might be a question — how do we differentiate new order versus a retried order. All systemic retries will use the same order id. However in other cases, there is no “algorithmic” way to do this — we rely on inferring “retry” from product flows.

Containing Blast Radius

While the strategies above work to ensure intermittent failures are handled gracefully, we still want to ensure that the impact of a failure impacts the least amount of customers. To enable this, we shard our infrastructure and map users to different shards. This is depicted below :

Handling Larger Failures

Once is blue moon, failures are severe — for example if there is an internet or power discontinuity to the datacenter. In this case , we need a different strategy — we need to failover entire systems to a different datacenter. This requires specialised engineering, from flipping DNS entries to ensuring that the backup services get primed. This is a larger topic and will be covered in a followup blog.