DevOps Backup in Amazon EC2

Implementing a Distributed Solution for a Distributed Problem

AWS Startups
AWS Startup Collection
9 min readOct 23, 2014

--

Backup and Recovery is critical to protecting data and providing your startup with a platform that will support uninterrupted growth. At the same time, it can be a challenge to implement well. This article guides you through the process of using Amazon Simple Storage Solution (S3) and Amazon Elastic Block Store (EBS) snapshots for reliable backup across your Amazon EC2 workloads.

The Problem

As your startup grows, data is generated on an arbitrarily large number of endpoints: laptops, desktops, servers, virtual machines, and now mobile devices. In other words, the problem is distributed in nature. Current backup software, on the other hand, is very centralized: The general model is to collect data from many devices and store it in a single place. Sometimes a copy of that stored data is also sent to tape.

But centralized data backup comes with several significant costs:

• Productivity falls during backups as backups must be staggered to avoid overwhelming a centralized backup storage device.
• Backing up centrally creates a single point of failure that can wipe out all backups.
• Since the tool isn’t integrated with the infrastructure, the backup software becomes an additional vendor and cost.
• The backup target is often colocated with source data, offering only resilience for hardware failure, not disaster recovery (DR).
• In a disaster scenario, the same staggered I/O pattern must be applied to recovery (provided the backup wasn’t part of the disaster), significantly increasing mean time to recovery (MTTR) and impacting business continuity.

As these costs suggest, the centralized backup paradigm does not scale, often resulting in broken recovery service-level agreements (SLAs).

The Solution

You can use some of Amazon’s integrated services to create a solution that takes advantage of the scalability and distributed nature of the cloud to simplify and reduce the cost of backup: a distributed solution for a distributed problem. In this solution endpoints act as autonomous units, backing themselves up to Amazon S3 via Amazon EBS snapshots, which are highly durable (99.999999999%), regionally accessible, globally distributable, disruptively inexpensive data stores. Instead of managing the backup centrally, this approach calls for centralizing monitoring and alerting. Local logs will be created so when an alert is triggered, fine-grained logging will be available on the instance for further troubleshooting.

For Amazon EC2 instances, we use snapshots, and for on-premises workloads we use file-level backups to Amazon S3.

Advantages of Cloud-Based Backup Solution

• Simple , free, and multiplatform tooling
• Extremely durable data store
• Free de-duplication
• Nearly infinite scalability
• Failure risk is contained to very small radius (per device basis)
• Ability to recover from disaster within AWS for continuity of business while local data center is offline
• No prepay; pay for what you use
• No software licensing

Corresponding Disadvantages

• No more super cool tape robots ☹
• Limited by connectivity to AWS
• Requires an operational paradigm shift

DevOps Autonomous Backups: A Coud-based Approach

Although this this article identifies specific software as examples, the aim is to present an approach to backup rather than a single solution. If you prefer other software or coding languages, feel free to implement them and share.

The general premise of this approach is that you can put each endpoint in charge of backing itself up and report the status and metrics of those backups. Here are the steps:

  1. Set up one of the Amazon CLIs. (See the documentation for details.) These come preconfigured on Amazon Linux in EC2.
  2. Create an AWS Identity and Access Management (IAM) role with the ability to create, tag, and delete snapshots; copy them to other regions; and PUT Amazon CloudWatch custom metrics.
  3. Set up CloudWatch alarms and create an Amazon SNS topic to connect the alarms with systems and support personnel.
  4. Configure a scheduling daemon. The example in this article uses Cron and Windows Task Scheduler.
  5. Avail yourself of a local logging mechanism. The example in this article uses syslog and Windows event log.
  6. Prepare a script or executable to launch snapshots and push metrics for EC2 instances. This article provides a sample Python script, which you can download from an Amazon S3 bucket.

Step 1: Set up the AWS Command Line Interface

Although you can use the AWS Management Console to do many tasks, the AWS Command Line Interface provides another useful option for interacting with the AWS cloud. The AWS CLI page includes download links for installing the CLI on your system.

Step 2: IAM Role Configuration

We begin with the most important part, the IAM role. This role allows the instance to make API actions on its own behalf, but only the actions you preapprove. Configure your instance(s) to launch with this new role. See the Amazon EC2 User Guide for IAM policies for EC2.

For examples of the minimum policy needed, download one of these JSON files:

https://s3.amazonaws.com/AWSTools/snapshotpolicy.json
http://s3.amazonaws.com/AWSTools/cloudwatchpolicy.json

Next, launch an instance with this IAM role. Unfortunately you cannot apply a role to a running instance so to use an existing instance, update your deployment methods accordingly to make this the new default.

To do this using the AWS Management Console, create your IAM role as explained in this quick video tutorial. Creating a role in the console ensures you have an instance profile for this role when you create your instance. Then go to the EC2 console, and click Launch Instance. Follow the steps in the wizard. On the Step 3: Configure Instance Details page, choose the IAM role you created from the IAM role dropdown. Then complete the remaining steps of the wizard.

To launch an instance with your IAM role using the command line, follow the example below.

Step 3: Define Amazon CloudWatch Custom Metrics and Configure Notifications

Now create three custom CloudWatch metrics; snapshot size, elapsed time, and errors so that we can create CloudWatch alarms.

Now create the alarms that will generate notifications if something goes wrong with our backups. The thresholds will depend on your internal SLAs and backup windows. We’ll use Amazon Simple Notification Service (SNS) to send the notifications. For more information on using Amazon SNS with Amazon CloudWatch, check out this video tutorial or see the Amazon CloudWatch Developer Guide.

Step 4: Creating a Recurring Event to Launch Backup Script

Creating a recurring event for launching a backup script in Linux is as simple as the following four commands:

For Windows it’s a bit more involved, but not much. You’ll need to download the Python, boto, and the Python script linked below. (If you’re using your own tooling, you can disregard boto and Python.) Extract the boto archive to a directory of your choosing.

https://github.com/boto/boto/archive/develop.zip
http://www.python.org/ftp/python/2.7.5/python-2.7.5.amd64.msi (64bit)
http://www.python.org/ftp/python/2.7.5/python-2.7.5.msi (32bit)
https://s3.amazonaws.com/AWSTools/SnapshotPruneAndLog.py

Now that the prerequisites are in place, we follow a pattern similar to our Linux steps with Windows Task Scheduler:

Step 5: Logging

Logging is handled locally, within your backup script. In the example, whatever is configured for system logger will be the log target, i.e., Windows Event Viewer and syslog for Linux. For more advanced deployments or deployments using an ephemeral fleet, consider using a log shipping product like Logly or Logstash, or simply write a cron job or Windows scheduled task to ship the logs to Amazon S3 for later parsing and lifecycle management.

Step 6: Prepare a Script for Snapshots

You can download a sample Python script that uses the boto to do the following:

• Connect to AWS endpoints from within the EC2 instance to enumerate the Amazon EBS volumes that are attached locally to the instance
• Enumerate the snapshots of each volume
• Prune the stale snapshots in accordance with the “retention” tag assigned to the instance
• Create new snapshots of each volume

Volume tags are preserved in the snapshots and the process is logged locally as well as in Amazon CloudWatch. The code is easily extended to log to other frameworks or copy snapshots to disaster-recovery regions.

Note: In addition to the workflow outlined above and contained in the script, be sure to use your best judgment in implementation details. For databases, it is recommended that you use the first-party tools supplied for their backup. In addition, if you are not using a journaled file system, like NTFS, or your workload does not provide transactional consistency, consider adding a file system quiesce operation (or a call to pause I/O in your application) to the backup script long enough to initiate the snapshot. With EBS snapshots, you do not need to lock I/O for the duration of the snapshot.

Follow Up

After the first day, look in CloudWatch to monitor the progress of the backups. Modify your alarm for elapsed time to what you expect for your backup window based on a running average of elapsed time.

Recovery with EBS Snapshots

The recovery of a snapshot is very simple and similar to the process on a typical storage area network (SAN) appliance. To begin, identify the snapshot you want to recover. If you’ve adhered to our prerequisites and side notes, you might do this with a simple script like this:

You can also use browse snapshots in the EBS console to find the one you want.

From here we just need to create a volume from the snapshot and attach it to an instance to recover the disk from its previous state.

The previous command line should return a response like the following. Take note of the volumeId returned.

If the volume you are recovering is a root or bootable volume, you will first need to shut down the instance you want to attach it to and detach the current root volume.

Now attach that volume to an instance to complete the recovery. Just make sure to use an unused device.

Be sure to tag the volume with the same relevant tags that its parent snapshot had, as this does not happen automatically.

Your recovery is complete.

Best Practices: Reducing Operational Complexity

Build self-describing infrastructure. By tagging all resources within your environment, you enable easier support and query of your infrastructure as well as unique detailed billing and permissions. This example assumes you are observing this practice and tag Amazon EBS volumes and their corresponding snapshots with the host name or other relevant information about the instance they descend from. A cmdb is another appropriate way to manage this data, but replication of the data is preferred for discrepancy resolution or in case access to one source is lost.

For more articles visit: http://aws.amazon.com/architecture

Nic Branker
Solutions Architect

--

--