Analyze wait statistics using QPI library
Wait statistics are information that might help you to understand why the SQL query duration is long and identify the queries that are waiting for something in database engine. As an example SQL query might be executing slower if it needs to wait for some resource (page that should be loaded into the memory), lock that should be released, or simply because Database Engine cannot process requests due to some bottleneck (e.g. already using all CPU or entire IO bandwidth to files).
If you believe that your queries are running slower than expected, you would need to see what are they waiting for.
In this post, I will show to you how to identify why the workload is waiting and what are the queries that are waiting on some resources.
Azure SQL Database Managed Instance enables you to find why the queries are waiting for some resources using the following views:
- sys.dm_os_wait_stats that returns instance level wait statistics
- sys.dm_exec_session_wait_stats that provide wait statistics for the session
- sys.query_store_wait_stats that returns query/plan-level wait statistics.
These information can be found using DMO/Query store. However, to make the the analysis easier, I’m using free/open-source Query Performance Insight library. This library is not a prerequisite, but it makes analysis easier because it has predefined views that join all necessary Query Store views to fetch the information. Since this library is open-source you can copy-paste the query examples from the views and use them even without installing the whole library.
In addition, it takes the snapshots of the wait stats and enables you to see wait statistics in the past.
To install QPI library go to the installation section and download the SQL script for your version of SQL Server (it supports Azure SQL/SQL Server 2016+ because it depends on Query Store views).
Analyze wait statistics
First thing that you need to do is to take a snapshot of the wait statistics or at least reset them because sys.dm_os_wait_stats collects wait statistics since the instance start or the last time you reset the stats. In QPI, you can use the following procedure to reset the wait statistics:
This procedure will reset wait statistics on your instance and the Managed Instance will start collecting new wait statistics.
While your workload is running, you can read the wait statistics values from the qpi.wait_stats view and sort them by the wait time in seconds in order to get the dominant wait times:
This view returns wait statistics and also their categories. Here you can see that the main wait statistic on my instance is INSTANCE_LOG_RATE_GOVERNOR categorized as Log Rate Governor category. Managed Instance enables you to insert up to 22MB/s in General Purpose tier and 48MB/s in Business Critical tier per instance (check the documentation because these characteristics might be changed in the future), so REBUILD INDEX operation that generates more data than the log can accept is waiting for available log throughput.
If you don’t know what is some wait statistic, you can follow the URL in info column and go to sqlskills site to find more details.
The category is important because if you want to find the queries affected by the wait type you need to use the category information to identify the queries that wait for this category.
There is a mapping logic between wait types and wait categories that is documented on SQL Server documentation and these rules are added in this library. This is important because this is the only link between global wait statistics and query store wait statistics.
If you want to see top queries affected by this wait category, you can use qpi.query_wait_stats view and filter the wait statistics per category:
In this case you can see that I’m hitting log rate limit on the Managed Instance and that this is probably caused by rebuild index statement. If you have multiple databases, you need to run this query in each of them because Query Store is configured per database.
If you think that IO is causing performance issues you should take a look at this post that explains how to measure and troubleshoot IO performance.
You can put any other wait statistic category in this query to find other queries that are affected by the wait type.
These views can enable you to easily troubleshoot the issues on your instance and drill down to the queries that might cause problems or the queries that are affected by some wait type.
QPI is open-source library that is not official part of SQL Server/Azure SQL Db scripts. If you find any issue in this library you can file a bug or submit the PR with the change.