Designing for High Availability

5 min readMar 14, 2019

What is Availability?

Availability is the ability of a system to be available for use after a fault occurs.

Availability is the capability of a system to repair failures so that the cumulative service outage period does not exceed a given time.

How do we measure Availability?

Availability= MTBF / (MTBF+MTTR)

MTBF=Mean time between failure

MTTR= Mean time to recover

What is fault in a highly available system?

In any highly available system fault can be defined as crash,incorrect timing, incorrect response, omission.

Tactics that can be applied to achieve highly available system.

Fault detection
Fault prevention
Recovery from Fault

Fault Detection

Ping: sync/aysnc message pair exchange between nodes, detects if the pinged system component is alive and is responding correctly.

Monitor: System Component that is used to monitor different part of system like memory, processes, IO, etc

Heartbeat : message exchange between system monitor and process

Timestamp: add a timestamp to every event, helps to detect an incorrect sequence of events, primarily used in a distributed system.

Sanity checking: checks the correctness of the output of a given component whether the response is correct or not.

Voting: redundant component takes input and forward output to voting logic, if there is any inconsistency a fault is reported.

Exception detection: detects a system condition that alters the normal flow of execution

Timeout: is to check if a component has failed to respond to a request in a given time limit

Fault prevention

Removal from service(chaos engineering): Temporarily place a system component out of service for the purpose of mitigating software failure and then make sure that the system operates in those failure constraints.

Transaction: Make sure async message exchange in a highly available system follows ACID properties. It prevents the race condition in the system when two-component try to update the same data.

Eg Two-phase commit

Predictive model: Monitor the state of a system to ensure that it is operating in normal operating parameters. (Memory, CPU, IO).If it’s beyond the normal parameters take necessary action and fix it.

Exception Prevention: Prevent exception from occurring and handle those in the application, use exception classes to gracefully recover from exceptions.

Increase competence: Include a set of states in which the system is competent to operate. Design your system to handle more and more cases( including faults ) as part of their normal operation.

Recovery from fault

Active Redundancy(Hot spare): Also known as 1+1 redundancy, where one extra redundant node is kept active and in sync with active node.Both nodes receive and process input in parallel. In case of failure of the active node,redundant node can be promoted in matter of milliseconds.

Passive Redundancy: Extra node is active but gets synchronized with the active node periodically.

Spare: Extra node is out of service and becomes active when one of the active nodes fails.

Exception Handling: Handle the exception and avoid the crash in the system.

Rollback: Rollback to a previous known working state.

Upgrade: Upgrade/patch the software to fix the fault.

Retry: Assume that the fault occurred was transient and retrying the request will fix the issue. This technique works in networks and server farms as errors are expected and common there.

Ignore faulty behavior: Ignore the fault from a particular source that is causing the error.

Eg: denial of service attack

Degradation: Work in degraded mode i.e. maintain only the most critical function of the system.

eg: switch off recommendation engine in case of video streaming sites

Shadow: Tactics is to operate the failed/out of service component in shadow mode and monitor it for fault, fix it and re-introduce it as an active node.

Escalating Restart: Fault recovery from the varying granularity of component restart.

eg: level 0 — where all the thread and process are restarted, level 5: restart the machine

Design consideration for the highly available system

Assign responsibilities to take care of above-mentioned tactics(detection recovery ,prevention )in case of crash, incorrect timing or response or other faults.

Which system component will be responsible for

Logging the fault, notifying the people, raising alerts on slack/email
Disabling the source of the fault, making it temporarily unavailable
Fixing /masking the failure
Operating system in degraded mode

Consider the impact on coordination/communication across different components of the system in case of a fault.

Is the system is still capable of logging fault, raising alerts?
Will the system work with degraded coordination/communication?
What will be the consequence of a fault ?
How much information loss system can withstand and continue to work.
Will the replacement of a component(communication channel /storage /process) allow the system to work?

Consider the impact on the data abstraction layer

Determine which data abstraction of the system along with their operations, their properties(read/write update )of a system can cause a fault.

For those data, abstraction ensure that they can be disabled , temporarily be unavailable or can be fixed/masked in case of fault

eg: write everything to cache if database is down due to fault

Consider what system component to artifacts(processor, storage) mappings can be changed or reassigned in case of fault

Which process on the failed processor needs to be reassigned?
Which processor/data store can be activated or reassigned?
How data on failed storage can be served by other storage units?
How to assign run-time elements to the processor , communication channel, or data storage?

Consider Resource limitation

Determine the system which will manage all the resource, and the impact if the resource hits the saturation limit.
Determine the limitation of all the resources and what all resources are necessary to continue the operation in case of a fault.

Eg: in case if message consumer fails, the queue should be large enough to hold all the messages

Consider Technology to achieve the above-mentioned tactics

Determine the software that will help to achieve fault detection, prevention, and recovery.
Determine the kind of fault the system can recover from and the type of complexity they introduce in the system.

Thanks for reading. You can connect me here.

sohit kumar (@ksohit) | Twitter

The latest Tweets from sohit kumar (@ksohit). swiss knife, solves problem using technology

twitter.com