Designing for High Availability
What is Availability?
Availability is the ability of a system to be available for use after a fault occurs.
or
Availability is the capability of a system to repair failures so that the cumulative service outage period does not exceed a given time.
How do we measure Availability?
Availability= MTBF / (MTBF+MTTR)
MTBF=Mean time between failure
MTTR= Mean time to recover
What is fault in a highly available system?
In any highly available system fault can be defined as crash,incorrect timing, incorrect response, omission.
Tactics that can be applied to achieve highly available system.
- Fault detection
- Fault prevention
- Recovery from Fault
Fault Detection
Ping: sync/aysnc message pair exchange between nodes, detects if the pinged system component is alive and is responding correctly.
Monitor: System Component that is used to monitor different part of system like memory, processes, IO, etc
Heartbeat : message exchange between system monitor and process
Timestamp: add a timestamp to every event, helps to detect an incorrect sequence of events, primarily used in a distributed system.
Sanity checking: checks the correctness of the output of a given component whether the response is correct or not.
Voting: redundant component takes input and forward output to voting logic, if there is any inconsistency a fault is reported.
Exception detection: detects a system condition that alters the normal flow of execution
Timeout: is to check if a component has failed to respond to a request in a given time limit
Fault prevention
Removal from service(chaos engineering): Temporarily place a system component out of service for the purpose of mitigating software failure and then make sure that the system operates in those failure constraints.
Transaction: Make sure async message exchange in a highly available system follows ACID properties. It prevents the race condition in the system when two-component try to update the same data.
Eg Two-phase commit
Predictive model: Monitor the state of a system to ensure that it is operating in normal operating parameters. (Memory, CPU, IO).If it’s beyond the normal parameters take necessary action and fix it.
Exception Prevention: Prevent exception from occurring and handle those in the application, use exception classes to gracefully recover from exceptions.
Increase competence: Include a set of states in which the system is competent to operate. Design your system to handle more and more cases( including faults ) as part of their normal operation.
Recovery from fault
Active Redundancy(Hot spare): Also known as 1+1 redundancy, where one extra redundant node is kept active and in sync with active node.Both nodes receive and process input in parallel. In case of failure of the active node,redundant node can be promoted in matter of milliseconds.
Passive Redundancy: Extra node is active but gets synchronized with the active node periodically.
Spare: Extra node is out of service and becomes active when one of the active nodes fails.
Exception Handling: Handle the exception and avoid the crash in the system.
Rollback: Rollback to a previous known working state.
Upgrade: Upgrade/patch the software to fix the fault.
Retry: Assume that the fault occurred was transient and retrying the request will fix the issue. This technique works in networks and server farms as errors are expected and common there.
Ignore faulty behavior: Ignore the fault from a particular source that is causing the error.
Eg: denial of service attack
Degradation: Work in degraded mode i.e. maintain only the most critical function of the system.
eg: switch off recommendation engine in case of video streaming sites
Shadow: Tactics is to operate the failed/out of service component in shadow mode and monitor it for fault, fix it and re-introduce it as an active node.
Escalating Restart: Fault recovery from the varying granularity of component restart.
eg: level 0 — where all the thread and process are restarted, level 5: restart the machine
Design consideration for the highly available system
Assign responsibilities to take care of above-mentioned tactics(detection recovery ,prevention )in case of crash, incorrect timing or response or other faults.
Which system component will be responsible for
- Logging the fault, notifying the people, raising alerts on slack/email
- Disabling the source of the fault, making it temporarily unavailable
- Fixing /masking the failure
- Operating system in degraded mode
Consider the impact on coordination/communication across different components of the system in case of a fault.
- Is the system is still capable of logging fault, raising alerts?
- Will the system work with degraded coordination/communication?
- What will be the consequence of a fault ?
- How much information loss system can withstand and continue to work.
- Will the replacement of a component(communication channel /storage /process) allow the system to work?
Consider the impact on the data abstraction layer
- Determine which data abstraction of the system along with their operations, their properties(read/write update )of a system can cause a fault.
For those data, abstraction ensure that they can be disabled , temporarily be unavailable or can be fixed/masked in case of fault
eg: write everything to cache if database is down due to fault
Consider what system component to artifacts(processor, storage) mappings can be changed or reassigned in case of fault
- Which process on the failed processor needs to be reassigned?
- Which processor/data store can be activated or reassigned?
- How data on failed storage can be served by other storage units?
- How to assign run-time elements to the processor , communication channel, or data storage?
Consider Resource limitation
- Determine the system which will manage all the resource, and the impact if the resource hits the saturation limit.
- Determine the limitation of all the resources and what all resources are necessary to continue the operation in case of a fault.
Eg: in case if message consumer fails, the queue should be large enough to hold all the messages
Consider Technology to achieve the above-mentioned tactics
- Determine the software that will help to achieve fault detection, prevention, and recovery.
- Determine the kind of fault the system can recover from and the type of complexity they introduce in the system.
Thanks for reading. You can connect me here.