Automated EC2 Recovery With AWS Backup Restore
In the previous write-up “Getting Started with AWS Backup Plan” we established the automated backup process. We will now look into the automated recovery procedure.
Recovery Considerations
Failure scenarios can be very different depending on the resource type and the regulatory requirements or SLAs we need to adhere to. Even for the same resource type, there can be different failure reasons with different recovery scenarios. For example: if we are running an EC2 instance, then one of the EBS volumes, or the EC2 instance itself could fail. In this blog post we are looking into the latter scenario. However, the solution provides a flexible approach and can be easily adjusted.
AWS Backup - Restore
As we are building on top of the previous blog post, we already have backups in our AWS Backup Vault. Now we will look into the automated restore process in the event of a failure.
Step 1: Defining the Recovery Trigger
We are using the failure event of an EC2 instance as a trigger for our automated recovery. Therefore we need to intercept this event via an AWS EventBridge Rule. We are using Terraform to develop our infrastructure as code:
resource "aws_cloudwatch_event_rule" "ec2_rule" {
… …
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
As an event target we define our Lambda Restore function, which will have all the smarts we need.
# Target for event
resource "aws_cloudwatch_event_target" "health-lambda" {
rule = aws_cloudwatch_event_rule.ec2_rule.name
target_id = "SendToSNS"
arn = aws_lambda_function.restore_lambda.arn
}
We will also need to give CloudWatch the permission to invoke our Lambda function.
# Permissions for Event Bridge to trigger the Lambda
resource "aws_lambda_permission" "allow_cloudwatch_to_call_restore" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.restore_lambda.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ec2_rule.arn
}
Step 2: Finding the Latest Recovery Point
Every Lambda Handler receives an event and a context object. By using these we can extract the instance ID, the AWS account ID and the AWS region:
# Get instance Id
def get_instance_id(event):
return event["detail"]["instance-id"]# Get AWS account Id
def get_account_id(context):
return context.invoked_function_arn.split(":")[4]# Get AWS region
def get_region(context):
return context.invoked_function_arn.split(":")[3]
We are storing those values in variables so we can reuse them later and construct the ARN (Amazon Resource Name) of the failed instance:
ec2_arn = f'arn:aws:ec2:{region}:{account_id}:instance/{instance_id}'
We can also extract tags from the failed EC2 so that we can put them on the recovered instance later on. This is only needed if we use tags, which is a best practice.
tags = ec2_client.describe_tags(
Filters=[
{
'Name': 'resource-id',
'Values': [
instance_id
]
}
],
…
As a next step we are looking up a list of available recovery points. By selecting the first item in the list we receive the latest recovery point.
def get_recovery_points_by_ec2_arn(ec2_arn):
return backup_client.list_recovery_points_by_resource(
MaxResults=120,
ResourceArn = ec2_arn
)
Step 3: Restoring the Latest Recovery Point
In order to start the recovery procedure there are a couple of things we need to know:
- The VPC and subnet ID into which we want to deploy
- The instance type we want to use
- The security group that we want to attach to the instance
- Potentially some optional items - e.g. an IAM role if we need it on our instance
When calling the start_restore_job function we can pass those parameters as Metadata. In our example we have hard-coded these values and using a key-value store would be a more flexible and scalable approach.
restore_job_id = backup_client.start_restore_job(
RecoveryPointArn=recovery_point_arn,
Metadata={
'Encrypted' : 'false',
'InstanceType': 't2.micro',
'VpcId' : 'vpc-00df06351130c3cb5',
'SubnetId' : 'subnet-036b1baf8e5341fbf',
'SecurityGroups' : 'sg-0eced3a3fd5c7508e'
},
IamRoleArn=role_arn,
ResourceType='EC2'
)
Once the above function is executed we can see the pending recovery job in the AWS Backup console. Shortly after that we will have a new EC2 instance running that is based on our latest recovery point.
Other Consideration
Some further considerations are:
- Recovery testing:
Being able to recover our instances or EBS volumes doesn’t mean that the data is readable or consistent. It is crucial to do regular recovery tests and validate the data. This could be a manual process, where we logon and validate the files or database. This could also be automated but will be a more sophisticated approach. - Tagging of recovered instances:
In our example we collect the tags of the failed instance. Ideally we want to copy those to the newly recovered instance. For this purpose we could intercept the recovery event and trigger another Lambda function - Re-usability:
In our example we hard-coded information like VPC and subnet. We could easily store them in the SSM Parameter Store.
Source Code
The complete source code is available on GitHub: