Identifying Database Performance Issues

Published in

Paperless Parts Tech Blog

7 min readMay 17, 2024

Part 1: Monitoring Red Herrings

Monitoring is a fundamental way to assess and manage the performance of an application. There are myriad tools which can be used. At Paperless Parts, we use Datadog and AWS CloudWatch as our primary performance monitoring tools. While these are powerful and insightful platforms, they are not without their issues. In this article we walk through an example of a Database performance issue, and how you can be misled when identifying the root cause.

Our example is an IOPS spike which drove a slowdown in Database queries. The slowdown in Database queries impact the services requesting the queries which results in an overall slowdown of our system.

We start by looking at our Datadog Metrics. The image below shows a chart which displays the amount of time our Postgres Database spends within the service requesting a query. The bars are broken down by the actions performed by a service and aggregated within time windows. This view is helpful in identifying when and where problematic queries may occur.

Looking only at the chart above, one could think the largest segments (i.e. service requests to Postgres) in the bars are driving the IOPS issue. You could even partially confirm this suspicion by looking at the traces which make up the largest segments. If we look at the image below, we can see a trace with two queries which are quite long. The overall resource execution time of the trace is in the p98 percentile. So, we might have a couple queries which are no longer efficient given the size or structure of data we are working with. Is this the cause of our database performance issue? Maybe.

However, if we look at the timing of the IOPS spike, we can see the spike precedes the operations which had longer duration. In the image below we can see our read operations jumped to over 5k operations per second, and this happened around 12:52, which is a few minutes earlier than the trace we evaluated above (12:55). So, it appears the IOPS spike was the cause of the longer operation times we observed above, not the other way around.

In order to better understand the cause of the IOPS spike we can use AWS CloudWatch. CloudWatch provides more granular information about database activity, like wait events. A wait event occurs when a Database session is waiting for a resource to become available or waiting for an operation to finish. The image below shows which wait events our Database session experienced during the IOPS spike.

We see a large increase in IO:DataFileRead and LWLock:BufferIO operations. IO:DataFileRead is an operation where the Database is reading data from disk. LWLock:BufferIO is an event where one Database process is waiting for another process to finish its operation. The process has to wait because it is accessing the same page being read into the shared buffer.

The IO:DataFileReads build up at the same time we see IOPS increase. So, our root cause is likely to be a query which ran at the beginning of the IO:DataFileRead build. We can adjust this view to see the SQL queries which make up the wait events. In the image below we see a SELECT statement on part_feedbackitems is the predominant driver of the IO:DataFileReads as IOPS increases. Was this query the cause of the IOPS spike? Possibly.

The queries begin right as IOPS begins to increase, and the queries continue through the majority of the IOPS rate increase. The timing lines up, and the queries make up the majority of the wait events for disk reads at the onset of the IOPS spike. However, much like our red herring above, we want more information before we consider this the cause of the issue.

We can use the Dimensions view in CloudWatch to get more information about each of these SQL queries. In the image below we see attributes about the query, like the number of calls per second, block hits, block reads, etc. This gives us a better understanding of how expensive each query is. For example, a high number of block reads indicate the process is reading a lot from disk. We can see the most expensive operation, in terms of disk reads, is the SELECT COUNT operation. A single call makes 1,907 block reads from disk and we have 169,375 block hits from cache. The average latency for each query is 879 milliseconds. That is an expensive query.

But what about our SELECT statement on part_feedbackitems? We only see “-” characters for each of the columns. AWS will show a “-” character if the query only has a single sample during that period of time. We can go back into Datadog and search for the query during the time of the IOPS spike to evaluate its latency. Unfortunately, we cannot get the same level of information in Datadog as we can in CloudWatch, but we can still observe overall latency. In the image below we can see the latency, over the same time period, only spikes to around 7 milliseconds, and we only have 52 total requests. So, it’s unlikely this query would be the driver of IOPS spikes as it is relatively quick and we don’t have a large number of them.

Let’s dive deeper into the COUNT query we observed above. We can copy the query from AWS and run an EXPLAIN ANALYZE operation on it. The image below shows a portion of the results of EXPLAIN ANALYZE. One of the most problematic parts of the query shows we scan over 2.8 million rows with a width of 63 bytes. This means we are pulling over 170 MB (2.8M x 63 bytes) of data when scanning the table. There are also multiple table joins and operations which account for additional memory use on top of the 170 MB. This certainly seems like a candidate which would contribute to excessive IOPS, especially if our shared cache does not contain most of the buffers in memory.

The COUNT query appears to be our problem. To confirm, we want to see when the queries occurred and how it lines up with the IOPS issue. If we look at the image below we see one of the queries occurs at 12:51:43 (AWS uses UTC). This is the exact time when our IOPS rate spike began. Given how expensive the query can be and when it occurs, it seems a likely candidate for our IOPS issue.

It’s important to note we have queries before and after the highlighted time, as well as many other queries occurring at the same time. These all can contribute to IOPS increases. For example, an expensive query like this could clear much of the shared Database cache requiring other queries to read from disk instead of the shared cache. In order to truly confirm this, we will need to tune the query and see what impact it has (coming in Part 2).

That concludes our investigation in Part 1. We hope you were able to see how monitors and metrics can be misleading. It is easy to confuse the impact of one issue as the cause of the issue. In order to identify the root cause we often need to use several tools. Here we found our Datadog metrics to be a good resource for identifying when and where we have an issue. However, we didn’t have enough granular data to identify the root cause. Leveraging AWS CloudWatch we got lower level data, like block reads and hits. Using SQL EXPLAIN ANALYZE, we got a clear story about what the query was doing. Finally, going back to AWS CloudWatch we were able to align an execution of the query with our IOPS spike. In Part 2 we’ll examine how we can tune this query and assess if it alleviates the IOPS issue.

Identifying Database Performance Issues

Written by Andrew DeAngelis