Enjoying an AWS Aurora main course?… How about ordering some healthy RDS backup sides beautifully served up by Terraform!

Laurent Allegre
Airwalk Reply
Published in
5 min readFeb 15, 2022
Photo by S O C I A L . C U T on Unsplash

A lot of teams run MySQL or PostgreSQL databases, often for critical software applications or internal systems, and companies are migrating or are planning to migrate their database to the cloud in order to reduce the complexity and overheads needed to achieve high availability, resilience, as well as the management of backups and upgrades.

In this article, we are going to describe how to leverage Terraform to deploy some additional Aurora (RDS) backup features.

We will start by going through a summary of the standard Aurora backup capabilities, and the reasons some companies might need additional backup options. We will follow up with some engineering considerations, and finally, describe the solution chosen to expand on some RDS API options that are not typically deployed with the standard Aurora service.

What are the Aurora backup standard features?

AWS provides some pretty awesome backup features in the box! As a quick summary, we get:

  • Automatic Cluster volume backup to Aurora service’s ‘backend S3 storage’
  • Continuous and incremental backups
  • No performance impact or interruption of service
  • Retention period between 1 and 35 days
  • Choice of backup window
  • Restore from backup data to any point within the retention period

So why would some businesses need anything more?

This will of course depend on which regulations and laws your business must comply with, and for example could include:

  • Align with country (national) regulatory records retention schedule
  • Implement a disaster recovery strategy
  • Implement a Cloud exit strategy
  • Export Database data for analytics

How can we extend the Aurora service to meet the additional backup requirements?

Aurora is part of the managed database service Amazon Relational Database Service (Amazon RDS). The AWS RDS API gives us access to variety of methods accessible through a programming or command line interface.

Terraform can of course provision, scale and modify some RDS resources. This will enable us to manage the RDS instances and clusters declaratively.
However, at the time of writing, some RDS interfaces and methods are not available as standard Terraform ‘resources’, nor are they necessarily suitable to be provisioned by Terraform.
For instance, RDS operations on the Aurora Clusters such as reboot_db_instance would more likely fall under the umbrella of database maintenance and operations, and for this reason, many teams will understandably decide that other tools or processes should be used.

Implementation considerations

For this particular task our main engineering considerations are:

  • Support Aurora infrastructure deployment options giving consumer teams a maximum of flexibility and choices for their database
  • Secure implementation, yet simple to manage and support
  • Automated and cloud native
  • Running costs, bearing in mind that some businesses will run a variety of projects including small to very large databases.
  • AWS RDS documentation extract: Amazon RDS is asynchronous, which means that some interfaces might require techniques such as polling or callback functions to determine when a command has been applied.

Implementation solution

A good starting point to deploy Aurora with Terraform can be found in this Terraform registry GitHub link. You should review and discuss all the arguments available in the aws_rds_cluster and aws_rds_cluster_instance resources. At a minimum, we’ll use encryption through a KMS key, and refrain from declaring a plain text password for the Aurora master user, preferring a data source referring to a password stored in Secrets Manager (See this article for a possible Secrets management solution).

The AWS SDK supports a good range of language-specific APIs for AWS services, and a popular choice is to use Lambda functions to make calls to the AWS APIs. However, since some RDS actions are asynchronous, it may be difficult for example, to predict how long creating and exporting snaphots could take in all possible situations and workloads.

Another option would be to use the AWS Systems Manager (SSM) Automation runbook, and here is how a simplified diagram of the solution would look.

The following actions can be used in a runbook:

  • aws:executeAwsApi this automation action calls and runs AWS API operations.
  • aws:waitForAwsResourceProperty this automation action allows for your automation to wait for a specific resource state or event state before continuing the automation.
  • aws:assertAwsResourceProperty this automation action allows you to assert a specific resource state for a specific step.

With Terraform, we can easily provision a SSM automation document from a template file, the IAM policies and role to execute the automation, as well as the resources that will trigger the execution of the automation steps.

Implementation with Terraform

Let’s assume our Database teams requested the feature to create manual Aurora snapshots , that can be kept beyond 35 days.
We will create a SSM Automation runbook template named ssm_rds_create_snap.yaml.tpl

Taking a look at the above SSM automation template, we see

  • The ‘parameters’ section containing variables such as RDS instance ID and Cluster Indentifier passed through the Terraform templatefile function.
  • The automation steps: the first 2 steps validate the availability of the Cluster and the RDS Instance. The 3rd step creates the snapshot, and the 4th waits until the snapshot is available. Success or failure messages are sent to a SNS topic that the team can subscribe to.

And we’d refer to the automation template in a terraform block, like:

It’s worth noting that in order to run an automation workflow, we are passing the ARN of Amazon Identity and Access Management (IAM) service role which must be configured with permissions to execute the automation and also permissions to invoke the RDS and SNS services.

A good start point is to review the AWS built-in AmazonSSMAutomationRole and customise a least-privilege IAM policy you can attach to the automation_assume_role for the creation of snapshots and exports to S3.

Conclusion

Having the flexibility to look under the bonnet and tune the engine as we wish is what AWS has given us from the early days with their strong focus on APIs.
On AWS, there are several options available to teams implementing a solution, and very often there is more than one way to achieve your goals.

Sometimes, a simple, Cloud native and cost effective solution might be what your customers need, as they may prefer a resilient solution to handle the critical processes handling their Aurora Database data.

Written by Cloud and DevOps Engineer Laurent Allegre at Airwalk Reply.

--

--