MySQL’s SRE Golden Signals

Part of the How to Monitor the SRE Golden Signals Series

MySQL is a key element of many, if not most, modern online systems, so its performance is of the utmost interest, including and especially if it’s used as-a-service such as AWS RDS.

The gold standard on MySQL monitoring is Vivid Cortex and their SaaS platform’s ability to get and analyze all these signals and much more. But if you don’t use Vivid Cortex and/or want to include these signals in your own core / integrated monitoring system, here is how.

All these methods require you to have a good MySQL connection, as there is no good way to get the signals without it (though you can get some stuff from the logs, and Vivid Cortex uses a protocol analyzer). Please make sure your monitoring user has very limited permissions.

Getting MySQL’s Golden Signals varies in complexity based on the version you are running, and if you are running MySQL yourself or using RDS. I apologize in advance for all the pieces & parts below.

A tricky thing, this MySQL

Note that MySQL Latency is fundamentally hard to get accurately, and there are tricky issues with some of the data, including how the Performance Schema works. The methods below outline a couple ways; there are several others, all of which have pros and cons.

None are great, but they give general latency you can use to know how things are doing. Be aware the methods below will not capture prepared statements nor stored procedures (both of which are probably rare for most people, depending on your programming language, app server, and framework).

What about RDS?

Everything below should apply to AWS MySQL RDS as long as the Performance Schema is on — you must turn this on via your AWS RDS Instance Parameter Group and then sadly restart your DB.

Mapping our signals to MySQL, we see:

  • Request Rate — Queries per second, most easily measured by the sum of MySQL’s status variables com_select, com_insert, com_update, com_delete and then do delta processing on the final sum to get queries per second.
    Different monitoring tools get these differently. Usually via SQL like:
    SHOW GLOBAL STATUS LIKE “com_select”;
    SHOW GLOBAL STATUS LIKE “com_insert”;
    Or you can sum all four at once to get a single number:
    SELECT sum(variable_value) 
    FROM information_schema.global_status 
    WHERE variable_name IN (‘com_select’, ‘com_update’, ‘com_delete’, ‘com_insert’) ; 
    You can also monitor the status variable Questions, but this includes lots of extra non-work things that add noise to your data. Less useful.
  • Error Rate — We can get a global error rate which includes SQL, syntax, and most all other errors returned by MySQL. We get this from the Performance Schema, using the User Event table instead of Global Event table because we truncate the latter later for Latency measurements, and the User table is small enough to be fast and to not auto-purge due to size. This is a counter so you need to apply delta processing. The query is:
    SELECT sum(sum_errors) AS query_count
    FROM performance_schema.events_statements_summary_by_user_by_event_name 
    WHERE event_name IN (‘statement/sql/select’, ‘statement/sql/insert’, ‘statement/sql/update’, ‘statement/sql/delete’);
  • Latency — We can get the latency from the PERFORMANCE SCHEMA. As noted above, there are complications to getting decent data: 
    We need to use the events_statements_summary_global_by_event_name table, as data from the other potentially useful tables (e.g. events_statements_history_long) is deleted automatically when a thread/connection ends (which is very often on non-Java systems like PHP). And the oft-mentioned digest table is mathematically difficult to use due to overflows and complex clock/timer calculations.
    But the events_statements_summary_global_by_event_name table is a summary table, so it includes data from the server start. Thus we must TRUNCATE this table whenever we read it — this is an issue if you have other tools using it, but it seems this cannot be avoided.
    The query is below, and gets latency for SELECT only as getting data for all queries unfortunately involves much more complex math or queries — there are two statements here, one to get the data and one to then truncate the table:

SELECT (avg_timer_wait)/1e9 AS avg_latency_ms FROM
WHERE event_name = ‘statement/sql/select’;

 performance_schema.events_statements_summary_global_by_event_name ;

  • Saturation — The easiest way to see any saturation is by queue depth, which is very hard to get. So:
    The best way is to parse the SHOW ENGINE INNODB STATUS output (or Innodb Status file). We do this in our systems, but it’s messy (see concurrency setting below). If you so this, the two variables you want is “Queries Running”# queries inside InnoDB” and “# queries in queue”. If you can only get one, get the queue, as that tells you InnoDB is saturated.
    Note the above output depends on innodb_thread_concurrency being greater than zero which used to be very common, but recently and in later versions people are setting to 0, which disables these two stats in the Status output, in which case you have to use the global Threads Running below.
    If innodb_thread_concurrency = 0 or cannot read the status output file, we have to look at the more easily obtainable THREADS_RUNNING global status variable, which tells us how many non-active threads there are, though we can’t see if they are queued for CPU, Locks, IO, etc. This is an instantaneous measurement so it can be quite noisy. You may need to average it over several monitoring intervals. Two ways to get it:
    SELECT sum(variable_value) 
    FROM information_schema.global_status 
    WHERE Variable_name = ‘THREADS_RUNNING’ ;
    Note Threads Connected itself is not a good measure of saturation as it’s often driven by saturation issues on the client, such as PHP processes (though that can alsobe caused by a slow DB).
    You can also use Threads Connected vs. Threads Max, but this is really only useful when the system is out of threads, which should never happen as modern servers should set the max higher than any possible client number (e.g. 8192+) and is an emergency as the DB is refusing connections.
    On AWS RDS, you can also get DiskQueueDepth from Cloud Watch which may be helpful if you have IO-limited workloads. For non-RDS, see the Linux Section for more IO info.
  • Utilization — There are many ways MySQL can run out of capacity, but it’s easiest to start with underlying CPU and I/O capacity, measured by CPU % and IO utilization %, both of which are a bit tricky to get (and to be taken with a grain of salt). See the Linux Section.
    One useful Utilization measurement we use is “row”reads, which after you do delta processing will give you Row Reads Per Second — MySQL can do up to about one million per CPU Core, so if your DB has little IO (data in RAM), this can give you a good sense of utilization, though using Linux CPU is more accurate if you can get and combine it with MySQL data. For row reads, two ways:
    SELECT variable_value FROM information_schema.global_status
    WHERE variable_name = ‘INNODB_ROWS_READ’ ;
    On AWS RDS you can also get CPUUtilization from Cloud Watch which may be helpful if you have CPU-limited workloads.

As you can see, MySQL is challenging and there are many more items you can read, both to get the above signals, and to get a broader sense of DB health for alerting, tuning, and so on.

Next Service: Linux Servers