Managing auto-scaling of RDS on AWS using the AWS CLI and Bash

Published in

funambol-techblog

6 min readAug 1, 2019

Cloud services enable you to increase or decrease the capacity of your deployment within minutes, but how can you scale a relational database automatically keeping predictable performance and costs under control?

In this post we will show how Funambol manages the vertical scale up and scale down of all its MySQL databases running on RDS and handles peak traffic using the AWS Command Line Interface, AWS CloudWatch alarms and some simple logic that can be even written in a Bash script.

Scaling Amazon Relational Database Service (RDS)

AWS has a very informative article about Scaling Your Amazon RDS Instance Vertically and Horizontally that summarizes all the options you have to scale RDS or Amazon Aurora. You can manually perform vertical scaling with a simple push of a button and the wide selection of instance classes and types allows you to find the one that best fits your production database.

But while scaling your web applications automatically might be straight forward, using for example Amazon EC2 Auto Scaling or containers on ECS/EKS, how can you automatically scale a RDS instance to optimize the costs and allocate extra CPU & memory according to the load of your application?

There is nothing available out of the box for RDS but it’s quite easy using AWS native services — CloudWatch and SNS — to build basic autoscaling on RDS.

Wait, what about Aurora Serverless?

Aurora Serverless (a service available since summer 2018) vertically auto-scales your MySQL database out of the box but it currently has significant limitations that can be blockers for many deployments. For example, it is available for MySQL 5.6 only and it does not allow most changes to the engine configuration, with default values that can be incompatible with your deployment. No MySQL 5.7, no event scheduler, no way to connect to the database from outside the VPC: a promising new service but not a solution for every deployment just yet.

Choose carefully what to monitor!

The very first step to auto-scale a RDS instance is to define a metric in CloudWatch that is significant to monitor the traffic of your production system, whatever it is — the CPU, write IOPS, free memory or something else significant for your scenario. Let’s for example scale up the database according to the following rule:

scale up when the average CPU for latest 15 minutes is above 75%

We can as well define multiple alarms and metrics that can trigger the scaling of our database. For example, to be on the safer side, we can perform the scale down on a combination of multiple alarms:

scale down when the average CPU for the lastest 24 hours is below 15% and the average CPU for the latest 2 consecutive hours is below 30%

In this scenario we will need one new CloudWatch alarm for scaling up and two alarms for scaling down. The 15%, 30% and 75% are just examples, you can find your own values monitoring the traffic of your instance for a couple of weeks.

Creating the alarms

To create the alarms, we now assume we have a Multi AZ MySQL RDS instance called demo-rds01 and a SNS topic demo-sns.

You can of course subscribe to the SNS topic (SMS, email) to be notified by any change in the alarms or you can use it to trigger a Lambda function that directly modifies the database. As alternative, you can skip the notifications, avoid the push approach and use either a crontab on a EC2 or a scheduled Lamda function to trigger a change.

Let’s now create the three alarms we discussed above and use the AWS CLI to achieve that:

aws cloudwatch put-metric-alarm \
    --alarm-name "demo-rds01-alarm-75" \
    --metric-name "CPUUtilization" \
    --namespace "AWS/RDS" \
    --statistic "Average" \
    --period 300 \
    --evaluation-periods 3 \
    --threshold 75.0 \
    --comparison-operator "GreaterThanOrEqualToThreshold" \
    --dimensions "Name=DBInstanceIdentifier,Value=demo-rds01" \
    --alarm-actions "<arn demo-sns>"aws cloudwatch put-metric-alarm \
    --alarm-name "demo-rds01-alarm-30" \
    --metric-name "CPUUtilization" \
    --namespace "AWS/RDS" \
    --statistic "Average" \
    --period 3600 \
    --evaluation-periods 2\
    --threshold 30.0 \
    --comparison-operator "LessThanThreshold" \
    --dimensions "Name=DBInstanceIdentifier,Value=demo-rds01" \
    --alarm-actions "<arn demo-sns>"aws cloudwatch put-metric-alarm \
    --alarm-name "demo-rds01-alarm-15" \
    --metric-name "CPUUtilization" \
    --namespace "AWS/RDS" \
    --statistic "Average" \
    --period 86400 \
    --evaluation-periods 1\
    --threshold 15.0 \
    --comparison-operator "LessThanThreshold" \
    --dimensions "Name=DBInstanceIdentifier,Value=demo-rds01" \
    --alarm-actions "<arn demo-sns>"

We see that the status of the new alarms changes almost immediately from “INSUFFICIENT_DATA”:

$ aws cloudwatch  describe-alarms --alarm-names "demo-rds01-alarm-75" --query "MetricAlarms[].StateValue"
 [
     "INSUFFICIENT_DATA"
 ]

to “OK”:

$ aws cloudwatch  describe-alarms --alarm-names "demo-rds01-alarm-75" --query "MetricAlarms[].StateValue"
 [
     "OK"
 ]

as soon as CloudWatch has parsed enough data. So far so good, the database is now in available state with an average CPU below the threshold. But what should we do when the status changes to “ALARM”?

Keep it simple! Let’s support in this example only m5 instances and scale up and down between db.m5.large and db.m5.12xlarge according to the load of the database.

Lambda function or script using the CLI?

We can work with AWS Lambda. That is of course the recommended approach to avoid a single point of failure as an EC2 instance to run a script. Having already an EC2 instance running in a size one autoscaling group, we want to make instead a simpler example using the Amazon CLI. We can now develop a very simple Bash script, something similar to:

cpu_status_alarm=$(aws cloudwatch describe-alarms --alarm-names  "demo-rds01-alarm-75"  --query "MetricAlarms[].StateValue" | grep 'ALARM' | wc -l)current_instance=$(aws rds describe-db-instances --db-instance-identifier "demo-rds01" --query "DBInstances[0].DBInstanceClass" | sed 's/^"\(.*\)"$/\1/')instance_status=$(aws rds describe-db-instances --db-instance-identifier "demo-rds01" --query "DBInstances[0].DBInstanceStatus" | grep 'available' | wc -l)   rds_status_available=$(aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[0].DBInstanceStatus | grep available | wc -l)if [ "$cpu_status_alarm" = "1" ] && [ "$rds_status_available" = "1" ]; then                    if [[ "db.m5.4xlarge" == "$current_instance" ]]; then                       
      new_instance_type="db.m5.12xlarge"         
   elif [[ "db.m5.2xlarge" == "$current_instance" ]]; then
      new_instance_type="db.m5.4xlarge"
   elif [[ "db.m5.xlarge" == "$current_instance" ]]; then
      new_instance_type="db.m5.2xlarge"     
   elif [[ "db.m5.large" == "$current_instance" ]]; then
      new_instance_type="db.m5.xlarge"                 
   fi         aws rds modify-db-instance --db-instance-identifier \
    "demo-rds01" --db-instance-class "$new_instance_type" \
    --apply-immediatelyfi

and add it to the crontab of the EC2 the new bash file:

*/5 * * * * /home/funambol/aws-mgmt-scripts/rds-demo-autoscaling

so the script is executed every few (5 in the example) minutes. This is a simplified example of the full code required. In a similar way we can define a function in Bash to perform the scale down. Or we can introduce further logics, for example to make sure that at most one single scale down is allowed every 24 hours or that no scale down is performed if there are long running queries. Still with just a few lines of code we are able to implement a basic and reliable (vertical) autoscaling of a production database.

What about horizontal scaling? As RDS can handle managed read replica, a similar approach can be used to add and remove read replica to the cluster.

Limitations

What’s wrong with the approach? Why is not available out of the box on RDS?

Even with a Multi-AZ RDS deployment, the failover of a RDS instance introduces DNS changes and up to 2–3 minutes of unavailability of the database. That’s a key difference between scaling Amazon RDS or Amazon Aurora versus the new Amazon Aurora Serverless where the auto-scaling is transparent to the web application, without a traditional failover between a master and a slave using a CNAME. Considering that, the tuning of the thresholds to scale up and down should be done trying to minimize the number of changes to the instance and avoid multiple downtimes every day.

Furthermore, not all instances are born equal: the simple scenario described above does not work well on T2 or T3 burstable instances where some logics on number of credits available should be added, to avoid having an almost dead database that never triggers a scale up ALARM.

Final thoughts

Do not rely only on this autoscaling approach to size your RDS: there are many other factors that can affect the configuration of your database and you should not rely on a single CPU metric. But this simple logic using CloudWatch alarms and Amazon CLI can still help you optimize the size of your RDS, keep costs under control and almost all mitigate outages in production when unexpected load affects the database.

Managing auto-scaling of RDS on AWS using the AWS CLI and Bash

Written by Renato Losio