Graphing MySQL Query Histograms

Vincent Werner
AVM Consulting Blog
2 min readMay 26, 2024

One of the best features within MySQL performance schema is query histograms, and much like in photography, they provide a great deal of information that can be used to improve the effectiveness of optimization efforts. There is one problem though, the data is stored in a way that is hard to assess directly and needs to be transformed first.

The most effective transformation from the raw data stored in the two histogram tables is to convert them to a graph.

Histogram tables

In MySQL, we have two histogram tables inside performance_schema:

events_statements_histogram_by_digest
events_statements_histogram_global

These two tables collect the histogram information of queries (histogram_by_digest) and the aggregated histogram of all queries run in the database, regardless of schema, which might be less than ideal if we have multiple schemas in the database.

These tables use the same fields to represent the histogram, with histogram_by_digest adding extra fields to identify the query. The data we need is stored as:

BUCKET_NUMBER            Bucket number
BUCKET_TIMER_LOW Minimum picoseconds for the bucket
BUCKET_TIMER_HIGH Maximum time (not included) for queries in bucket
COUNT_BUCKET Queries in the bucket
COUNT_BUCKET_AND_LOWER Queries accumulated
BUCKET_QUANTILE Quantile or percentile represented including this

We can use just two fields to draw the histogram:

SELECT bucket_timer_high/100000000 time,
count_bucket calls
FROM events_statements_histogram_global
ORDER BY bucket_number ASC;

The above query produces the data we need sorted by time and converted to milliseconds.

Assuming we have the data collected in an array of dictionaries with both fields, we can use matplotlib to draw:

..
import matplotlib.pyplot as plt
..
# NOTE: This code is adapted from a script, and only shown as example
def create_graph_histogram(data, title):
plt.figure(figsize=(16,12))
yval = [row['calls'] for row in data]
xval = [str(round(row['time'],2)) for row in data]
yval.pop()
xval.pop()
plt.xlabel(f"Layency in ms between 0 and {max(xval)} ms")
plt.ylabel("Calls")
plt.title(f"Latency Histogram for {title}")
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(10))
plt.plot(xval,yval)
plt.savefig(f"{title}.png")

This generates a graph like:

A graph showing the latency histogram of a specific digest

The above code makes some decisions to improve the visualization of the graph:

  • Removes the last data point as that one covers a very large chunk of time that is empty in most cases (xval and yval.pop)
  • Limits the ticks in the x axis to 10, otherwise matplotlib will try to use all data points, leading to weird graphs.
  • It sets a relatively large size for the graph so there should be enough place for the tics and legends.

Additionally, there are other optimizations we can add to make the graph more expressive:

  • Use bucket_quantile to only draw until 99.5 of the calls have been drawn (adding bucket_quantile < 0.995 to the previous query)
  • When drawing a histogram for a query, we can also gather the explain plan of the query, as well as other data from the performance schema.

--

--