Monitoring Amazon RDS in Production: Beyond AWS CloudWatch

9 min readMay 18, 2016

Introduction

Amazon’s Relational Database Service (RDS) is one of the most popular database services in the world, used by 47% of companies on AWS according to 2nd Watch’s 2015 AWS Scorecard. In part one of this blog series, I described the top 10 challenges of monitoring Amazon RDS when dealing with larger scale production deployments.

Amazon CloudWatch is the default service Amazon provides to monitor RDS, but it has many known limitations. First, it lacks advanced analytical features such as calculated fields and dynamic alert thresholds. More powerful analytics help you monitor and alert on the most meaningful information.

Second, databases are often part of a complex system, and you often need to correlate metrics across your stack to solve problems or identify service-level trends. To get a complete view, you’ll also need to send custom metrics and events, which can take extra work to instrument and configure.

Third, you occasionally need to look back to see changes over time and to put patterns and issues into a meaningful context. This requires visibility from a month or even a year ago. CloudWatch only gives you two weeks of retention for your metrics data, which often isn’t enough to discern what’s a normal change from a performance concern.

SignalFx provides real-time cloud monitoring and intelligent alerting for all the services across your modern stack. It performs analytics on metrics as they stream from RDS, plus any custom metrics you designate, aggregated with metrics from the rest of your cloud infrastructure and services in your environment, with 13-month retention to see changes over time.

RDS Dashboard

SignalFx gives you a built-in RDS dashboard right out of the box so that you can monitor the metrics that matter to performance without the guesswork or painful trial-and-error. Here is an overview of the information we give:

Instances

Keeping track of the instances you deployed in AWS is absolutely essential to ensure system availability, performance, and cost-effectiveness.

# DB Instances

Amazon makes it really easy to add additional instances, but it’s expensive to leave them running when you don’t need them. For example, someone on your team might have added a read replica to do some analytics jobs, but then forgot to shut it down. Furthermore, you might have scripts in Elastic Beanstalk or other places that add new instances automatically or behave unexpectedly. It’s a good idea to set an alert to fire if the number of instances goes beyond a normal level.

# DBs by Class

One of the easiest ways to scale your database is by increasing the size of your instance. Do this too many times and, eventually, you’ll start paying big bucks for XL instances. Keep an eye on these instance types to make sure you’re not overspending.

Engine Names for All DB Instances

This gives you an overview of how many different engines your team is using. It’s nice to standardize your team on a single engine, but you’ll see it here if a team is using a different database like the newer Aurora.

Read Performance

Read performance is important for web applications because content is more often displayed than edited. Additionally, read latency can directly impact user experience by making pages faster to load. Optimizing reads can have a big impact on performance and cost-effectiveness.

These charts are percentile distributions. Percentiles are often more useful than averages because outliers can misrepresent the typical performance. The graphs plot the minimum, P10 (or tenth percentile), P50 (otherwise known as median), P90, and maximum. The most prominent color in the screenshot above is pink, which is the maximum.

ReadIOPS

This shows how many disk read I/O operations per second your database has. They take longer than responses that are cached in memory, so you want this number to be as low as possible. If it’s too high, consider adding more RAM to your instances. Add capacity by switching to SSDs with provisioned IOPS storage.

ReadLatency

Read latency is the amount of time it takes to respond to a read request such as a select statement. The lower this is, the faster pages load and transactions execute. For most simple reads, you want this to be in the tens of milliseconds. If it’s too high, take a look at your slow query log to determine which queries are taking the longest and then tune them for better performance.

ReadThroughput

How many bytes per second is your database reading from disk? Read throughput will be high if your application reads large volumes of data per request or your responses are heavily cached. It can be lower if you have smaller data sizes, complicated queries that generate high latency, or slow magnetic disks.

Write Performance

Monitoring write performance is important to make sure that updates are immediately available, or at least not far behind real-time. Also, if your database is heavily indexed, writes can be more resource-intensive. If you make heavy use of the query cache, lots of writes can lower your read latency.

WriteIOPS

Your database writes many I/O operations to disk per second. If you need more, consider switching to SSDs with provisioned IOPS.

WriteLatency

Write latency is the time it takes to complete a write operation in the database, such as an insert, update, or delete. Low latencies are important for real-time applications. If you use read replicas, also check your ReplicaLag to make sure they are not too far behind.

WriteThroughput

The number of bytes per second that you are writing into the database is write throughput. This can be lower if you use table indexes, table locking, foreign key constraints, or slow magnetic disks.

System Metrics

CPUUtilization

On your specific instance, this is the percentage of CPU that your database is consuming. If you’re consistently hitting 100% there could be a negative impact on your read or write latency. You might benefit from larger instances with more CPUs or additional read replicas or shards.

DatabaseConnections

This is the number of database connections. Check your instance type for the limit on the number of connections allowed. It’s a good idea to set an alert to fire before you hit the limit. You might want to check the connection pool size in your application servers or add additional capacity.

NetworkReceiveThroughput

It’s important to monitor the number of bytes per second you receive through the network interface. Each instance type has a set capacity for network throughput. Keep in mind that snapshots and replication can also use up your network capacity.

NetworkTransmitThroughput

How many bytes per second do you send through the network interface? Each instance type has a set capacity for network throughput. Keep in mind that snapshots and replication can also use up your network capacity.

SignalFlow Analytics

SignalFlow is the analytics engine powering your SignalFx service, enabling charts and dashboards to populate with high metric resolution, even for more complex or derived metrics. It calculates new metrics at write time to help you better understand the current state of your system in real time. Because of SignalFlow’s unique data processing and analytics capabilities, SignalFx users can set dynamic thresholds (e.g., percentile, variance, rate-of-change) for alert detection that respond to service-level changes and trends. As a result, you can actually predict the future state of your infrastructure based on correlation across elements and historical comparison so you can proactively address performance patterns hours or days before they become critical issues.

On its own, CloudWatch only provides basic data rollups like min, max, average, sum, and data samples. Like most other metrics monitoring systems, CloudWatch alerts are limited to static thresholds, which can easily become a source of dreaded false-positive alerts. With CloudWatch as one of many data sources for SignalFx, you get access to a much wider range of calculations like timeshift, integrate, and dozens more. SignalFlow can filter and aggregate across multiple dimensions. It also performs these calculations in real time, so alerts fire when they are supposed to, not some unpredictable number of minutes after the fact.

Most importantly, SignalFx can use calculations to trigger alerts on service-level patterns, so you’re notified with enough time to fix an impending issue well before it affects performance or availability. For example, SignalFx’s built-in dashboard for RDS automatically uses trend analysis to determine disk space and the number of days of capacity you have left. Schedule time to provision more capacity with plenty of runway or fire an alert when remaining capacity hits a practical point of intervention.

In the dashboard below, you can see the amount of free storage space left to MySQL, the rate of change in the past hour, and the number of days left on disk at that rate. Thankfully, the current rate of change is low, so we have several thousand days of capacity left on the disk.

Here’s how you can set this calculation up as a SignalFlow program. Focus on a worst-case scenario in which you assume the rate of change in the last hour will continue for the next 24 hours. In the screenshot below, line A is the minimum free storage space over the past hour. Line B is the maximum rate of change over the past hour. Line C is the minimum number of days left on disk based on the rates in the last hour calculated as A/(B*24). Exclude rates of change less than zero because you should only be interested in increasing storage needs that could cause your disk to get full. Also, cap the number of days at 10,000 to make the chart easier to read. (Of course you should customize this calculation to your own needs.)

The pattern shown above will work for any metric where you can predict the future growth. It’s best to fire alerts when you hit capacity limits, including table size limits, database connection limits, and IOPS limits for your storage. That gives you enough time to react before the limit is reached.

Aggregated Metrics

Databases rarely operate in isolation, which is why troubleshooting and capacity planning often require you to correlate behavior with your other systems. While CloudWatch is good at collecting data from AWS systems, it doesn’t easily correlate data from your applications and other in-house systems, required for modern infrastructure monitoring. While there are options for CLI or API, CloudWatch still requires you to do the hard work of integrating.

SignalFx makes it easier with support for a variety of applications and monitoring agents including collectd, Graphite, Ganglia, and more. With validated plugins for a wide range of data sources, you can send metrics from your open-source web and application servers, your custom app components, and on-prem systems.

For example, you may want to see if an increase in traffic results in an increase in your ReadIOPS. You can see that, while the requests go up, the ReadIOPS do not. That means you can add more traffic without having to worry about running out of storage I/O capacity, at least in the near term.

SignalFx gives you a more complete picture of your operating environment so you can quickly see the impact of changes, get intelligent alerts on meaningful, service-level trends, and take action when it matters most. Get started now with a free trial!

Read the original post on the SignalFx blog.