Automated EC2 Recovery With AWS Backup Restore

Recovery Considerations

Failure scenarios can be very different depending on the resource type and the regulatory requirements or SLAs we need to adhere to. Even for the same resource type, there can be different failure reasons with different recovery scenarios. For example: if we are running an EC2 instance, then one of the EBS volumes, or the EC2 instance itself could fail. In this blog post we are looking into the latter scenario. However, the solution provides a flexible approach and can be easily adjusted.

AWS Backup - Restore

As we are building on top of the previous blog post, we already have backups in our AWS Backup Vault. Now we will look into the automated restore process in the event of a failure.

resource "aws_cloudwatch_event_rule" "ec2_rule" {
… …
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
# Target for event
resource "aws_cloudwatch_event_target" "health-lambda" {
rule = aws_cloudwatch_event_rule.ec2_rule.name
target_id = "SendToSNS"
arn = aws_lambda_function.restore_lambda.arn
}
# Permissions for Event Bridge to trigger the Lambda
resource "aws_lambda_permission" "allow_cloudwatch_to_call_restore" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.restore_lambda.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ec2_rule.arn
}
# Get instance Id
def get_instance_id(event):
return event["detail"]["instance-id"]
# Get AWS account Id
def get_account_id(context):
return context.invoked_function_arn.split(":")[4]
# Get AWS region
def get_region(context):
return context.invoked_function_arn.split(":")[3]
ec2_arn = f'arn:aws:ec2:{region}:{account_id}:instance/{instance_id}'
tags = ec2_client.describe_tags(
Filters=[
{
'Name': 'resource-id',
'Values': [
instance_id
]
}
],
def get_recovery_points_by_ec2_arn(ec2_arn):
return backup_client.list_recovery_points_by_resource(
MaxResults=120,
ResourceArn = ec2_arn
)
  • The VPC and subnet ID into which we want to deploy
  • The instance type we want to use
  • The security group that we want to attach to the instance
  • Potentially some optional items - e.g. an IAM role if we need it on our instance
restore_job_id = backup_client.start_restore_job(
RecoveryPointArn=recovery_point_arn,
Metadata={
'Encrypted' : 'false',
'InstanceType': 't2.micro',
'VpcId' : 'vpc-00df06351130c3cb5',
'SubnetId' : 'subnet-036b1baf8e5341fbf',
'SecurityGroups' : 'sg-0eced3a3fd5c7508e'
},
IamRoleArn=role_arn,
ResourceType='EC2'
)

Other Consideration

Some further considerations are:

  • Recovery testing:
    Being able to recover our instances or EBS volumes doesn’t mean that the data is readable or consistent. It is crucial to do regular recovery tests and validate the data. This could be a manual process, where we logon and validate the files or database. This could also be automated but will be a more sophisticated approach.
  • Tagging of recovered instances:
    In our example we collect the tags of the failed instance. Ideally we want to copy those to the newly recovered instance. For this purpose we could intercept the recovery event and trigger another Lambda function
  • Re-usability:
    In our example we hard-coded information like VPC and subnet. We could easily store them in the SSM Parameter Store.

Source Code

The complete source code is available on GitHub:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gerald Bachlmayr

Gerald Bachlmayr

Principal Cloud Architect at Cuscal Payments