AWS Aurora’s SRE Golden Signals

Part of the How to Monitor the SRE Golden Signals Series

AWS Aurora is an increasingly popular database engine option, essentially a high-performance upgrade of AWS RDS. Fortunately for us, it also includes very useful monitoring metrics that make it easier to monitor.

For reference, you can also check out the MySQL’s SRE Golden Signals article

As noted in the MySQL’s SRE Golden Signals article, all the good Golden Signals require you to have a connection to the database, which is annoying, and for RDS, requires an agent or some code running on a VM somewhere.

Fortunately Aurora provides what we need through AWS Cloud Watch, so once you can get metrics from that, you are set to go.

A nice thing, this Aurora

Mapping our signals to Aurora, we see:

  • Request Rate — Queries per second which CloudWatch has as Queries. If you need to break out read vs. write queries, there is both SelectThroughput and DMLThroughput (for inserts, updates, and deletes).
  • Error Rate — Aurora has some useful metrics for some types of ‘errors, though real SQL errors still require the Performance Schema, see below.
    For login failure, you can get the LoginFailures item, which includes users unable to login due to reaching max connections, plus password failures (which can signal a hack attempt).
    Another useful metric is BlockedTransactions, which I think means blocked by locks, so any big rise in this means you have locking issues.
    If you turn on the Performance Schema you can get a global error rate which includes SQL, syntax, and most all other errors returned by MySQL. This is a counter so you need to apply delta processing. The query is:
    SELECT sum(sum_errors) AS query_count
    FROM events_statements_summary_by_user_by_event_name 
    WHERE event_name IN (‘statement/sql/select’, ‘statement/sql/insert’, ‘statement/sql/update’, ‘statement/sql/delete’);
  • Latency — Aurora provides this directly in CloudWatch via two metrics: SelectLatency and DMLLatency. The former is probably the most important as it’s usually where you’ll see app performance issues first, so if you can only alert on one, use that.
  • Saturation — Aurora can provide disk queue depth via DiskQueueDepth, but probably not InnoDB queue depth (see MySQL’s SRE Golden Signals on InnoDB). Aurora also directly provides a metric for running or active queries, called ActiveTransactions.
  • Utilization — There are many ways Aurora can run out of capacity, but it’s easiest to use with underlying CPU % and I/O rates, measured by CPUUtilization and ReadIOPS (WriteIOPS is also available but usually less indicative of problems, but Reads will jump due to higher loads or worse SQL).

As you can see, Aurora is far easier than MySQL or even RDS, as it provides direct metrics for most of the things we care about.

PostgreSQL Aurora

All of the above is focused on MySQL, since that’s where our experience is, though looking at the Aurora metrics, most or all of these also apply to pgsql, too.

Main Article on How to Monitor the SRE Golden Signals including links to other services such as Linux, PHP, Load Balancers, and more.