Automated Disaster Recovery using Cloudwatch Alarms

5 min readMay 21, 2024

Introduction

A highly available environment is standard best practice in modern times for maximum up-time of your systems. However, there are cases where this is simply not possible, especially in cases such as legacy systems where it is still a necessity to operate servers as ‘Pets’ rather than the preferred ‘Cattle’ standard.
In this blog post, I will go through how such systems can be set up for Disaster Recovery in an automated way to minimize down-time and effectively making the environment Highly Available.

HA vs DR

Just to briefly go over some key concepts as background to the method used in this post:

High Availability (HA) — Eliminates single points of failure. Introduces redundancy of resources so that if something fails there are other resources that can continue to serve the same needs

Disaster Recovery (DR) — Picks up where high availability fails or is not possible for some reason. Can be as simple as restoring from a backup to a fail-over to another environment. Needs depend on Recovery Time Objective (RTO), the maximum amount of time a system can be down and Recovery Point Objective (RPO), the amount of data loss that is tolerable

The very simple diagram above shows HA vs DR of compute (EC2 instances). On the left, 2 instances are simultaneously serving a load balancer in an HA setup whereas on the right, we switch between the 2 instances (fail-over) which is applicable in systems that can only be served by a single node (legacy-based).

The method outlined in this blog post is focused on RTO, specifically of compute resources. Basically, we will automate failover with minimal downtime.

HA on Legacy Systems?

A legacy system is an outdated solution that is still in use. It is necessary to meet the needs it was designed for but growth and innovation is limited. Some examples of things that may prevent typical high availability setups on legacy systems:

The system relies on static IP addressing — legacy systems can be tightly-coupled to their IP addresses
The system is stateful and can’t deal with processes on more than one node (monolithic nature)
Nodes need to be managed as ‘Pets’ rather than ‘Cattle’ (also relates back to stateful).
Pets — system resources that you care for and loss of resource usually means some amount of system downtime
Cattle — expendable resources and can easily be replaced

These factors preventing High Availability is the motivation behind using this method for Automated Disaster Recovery

Automated DR of compute

The diagram below shows what we are implementing to result in Automated Disaster Recovery using Cloudwatch Alarms and hence the point of this blog post

Steps that this automated setup executes:

A cloudwatch alarm for System status check is triggered when an instance fails the check (1 alarm for each instance)
A lambda is triggered and does the following
Start failover instance
Associate Elastic IPs to failover instance (typical in a legacy system to have static IPs)
Set listener weights to point to the failover target group

Sample lambda code in python:

import json
import boto3
import time


client = boto3.client('elbv2')
ec2_client = boto3.client('ec2')
lambda_client = boto3.client('lambda')

loadbalancer_arn = 'arn:aws:elasticloadbalancing:<region>:<accountNo>:loadbalancer/app/<albName>/<albID>'
target_group_arns = { 'main': 'arn:aws:elasticloadbalancing:<region>:<accountNo>:targetgroup/<tgName>/<tgID>',
    'dr': 'arn:aws:elasticloadbalancing:<region>:<accountNo>:targetgroup/<tgDRName>/<tgDRID>' }
instance_ids = { 'main': 'i-<instanceID>', 'dr': 'i-<DRInstanceID>'}


# eip allocation by interface id (static public IP mapping to static private IP mapping)
eip_allocation = {
    "eni-<eniID>": [ # primary instance eni 0
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.0.67'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.0.21'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.0.71'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.0.57'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.0.203'},
    ],
    "eni-<eniID>": [ # secondary instance eni 0
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.1.20'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.1.245'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.1.246'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.1.72'},
        {'id': 'eipalloc-<eipID>', 'ip': '10.0.1.45'},
    ],
}


def listener_https():
    return client.describe_listeners(
        LoadBalancerArn=loadbalancer_arn,
    )['Listeners'][0] # return the first listener on alb

# start instance if stopped
def start_instance(tg_instance_id):
    instance = ec2_client.describe_instances(InstanceIds=[tg_instance_id])['Reservations'][0]['Instances'][0]
    if instance['State']['Name'] != 'running':
        ec2_client.start_instances(InstanceIds=[tg_instance_id],)
        time.sleep(10)
        instance = ec2_client.describe_instances(InstanceIds=[tg_instance_id])['Reservations'][0]['Instances'][0]
        while instance['State']['Name'] != 'running':
            time.sleep(10)
            instance = ec2_client.describe_instances(InstanceIds=[tg_instance_id])['Reservations'][0]['Instances'][0]
    return instance


# associate eips to failover instance
def associate_eips_to_failover(instance):
    enis = instance['NetworkInterfaces']
    for interface in enis:
        for eip in eip_allocation[interface['NetworkInterfaceId']]:
            print(f"AllocationId={eip['id']}, NetworkInterfaceId={interface['NetworkInterfaceId']}, PrivateIpAddress={eip['ip']}")
            ec2_client.associate_address(AllocationId=eip['id'], NetworkInterfaceId=interface['NetworkInterfaceId'], PrivateIpAddress=eip['ip'],)


def set_target_weights(key, target_groups):
    for tg in target_groups:
        if tg['TargetGroupArn'] == target_group_arns[key]:
            tg['Weight'] = 1
        else:
            tg['Weight'] = 0


# switch to new target group for failover instance
def switch_listener_tg(target_groups, listener):
    client.modify_listener(
        ListenerArn=listener['ListenerArn'],
        DefaultActions=[
            {
                "Type": "forward",
                "ForwardConfig": {
                    "TargetGroups": target_groups
                }
            }
        ]
    )
    print(f"target_groups switched to {listener_https()['DefaultActions'][0]['ForwardConfig']['TargetGroups']}")

# set target group weights for listener
def set_listener_weights(instance_id, target_groups, listener):
    if instance_id == instance_ids['dr']:
        set_target_weights('dr', target_groups)
    else:
        set_target_weights('main', target_groups)
    switch_listener_tg(target_groups, listener)


def lambda_handler(event, context):
    listener = listener_https()
    target_groups = listener['DefaultActions'][0]['ForwardConfig']['TargetGroups']
    print(f"target_groups before: {target_groups}")
    
    try: # trigger on alarm
        tg_instance_id = event['alarmData']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId']
        print('alarm triggered')
        for id in instance_ids.values():
            if id != tg_instance_id: # use the id of the non-triggered server to failover to
                instance = start_instance(id)
                associate_eips_to_failover(instance)
                set_listener_weights(id, target_groups, listener)

Cloudwatch Alarm — Status Checks

System status — Monitors the AWS systems on which your instance runs. Detects underlying problems with instance that require AWS involvement to repair e.g. AZ outage, host outage, physical issues
Instance status — Monitors the software and network configuration of the individual instance. Requires your involvement to repair e.g. incorrect networking or startup config, exhausted memory, corrupted file system

So the reason we are using System status as the check is that this occurs when there is a problem on the AWS side (AZ outage, power failure etc…)

Conclusion

This pattern can be extended to use triggers of other types if necessary i.e. if you have an accurate health check you can use that to trigger the failover if higher sensitivity than system status check is required
Failover operations may not work if they rely on the control plane to be operational in the AZ that is affected so be aware of this when writing your lambda code
Warning - if you make a legacy system more resilient it will decrease the drive to decommission them ;)