How to identify why workload performance on Azure SQL Managed Instance are different than SQL Server?
When you migrate your databases from SQL Server (on-premises or Azure VM) to Azure SQL Managed Instance, the first thing that you might want to do is to compare the performance of your database in new environment with the performance of the original database on the source SQL Server.
In some cases, you might see different performance of some queries or entire workload on Managed Instance compared to the performance of the source SQL Server. Managed Instance is fully-managed SQL Server database engine with built-in management operations that guarantee that automatic backups will be always taken and that your database will be 99.99% available, and some of the settings and configuration might cause performance differences.
In this article you can find some guidelines that can help you to identify the root causes of the performance differences and to optimize performance of your Managed Instance.
Compare environment settings
The first thing that you need to do is to verify that your on-premises SQL Server and Managed Instance and databases that you migrated have the similar settings and characteristics. Check do you have the same number of cores (Managed Instance uses hyper-threaded 2.4GHz processors), similar amount of memory (5.1GB/core on Managed Instance), etc.
There is also a large set of properties that might be different and that can cause the performance difference such as recovery model, compatibility level, enabled encryption, etc. In this story you can find a script that compares the settings on your SQL Server and Managed Instance as shown in the following figure:
In this case, different MAXDOP and encryption might cause the performance differences. It would be good to align the settings in order to make apple-to-apple comparison.
You don’t need to take some immediate action whenever you see some difference or assume that any difference listed here can the cause the performance issue. If possible, try to modify the settings on Managed Instance to match SQL Server in order to make apple-to-apple comparison. If this is not possible, just keep in mind these differences because they might indicate the issue in correlation with some of the findings in the following sections.
There might be a lot of different settings that are returned in this report, but you should try to find some of them that are different and that can impact performance either on Managed Instance of SQL Server.
Identify potential issues
Make sure that your Managed Instance and SQL Server do not have some underlying issues that might cause the performance problems. Here are some scripts that you can use:
- Check sys.dm_db_tuning_recommendations view or open Query Store/Top Regressed Queries report in SQL Server Management Studio to see are there some SQL query plan regressions. We are recommending to always enable AUTOMATIC_TUNING (FORCE_LAST_GOOD_PLAN) on your Managed Instance databases.
- Run community tools such as SQL Tiger BPCHECK or First Responder Kit sp_Blitz to identify the potential issues such as a lot of plan compilations, query plans with timeouts, a lot of virtual log files, etc. Use different and multiple tools if possible because any tool might help you to discover some potential issue.
If there is some issue on Managed Instance that don’t exists on SQL Serveror vice versa — this might be a reason for the performance differences.
Try to resolve as much as possible issues before continuing investigation and re-running performance tests.
If you are using General Purpose Managed Instance, make sure that you apply best practice to optimize storage.
Check CPU/Memory usage
If you are sure that you are comparing apples-to-apples and that there are not hidden issues either on Managed Instance or SQL Server, check the CPU and memory usage both on SQL Server and Managed Instance
You can use Server/Performance dashboard reports in SQL Server Management studio, track CPU usage in Azure portal, or use
master.sys.server_resource_stats view to monitor CPU usage on Managed Instance.
The increase of CPU usage might be expected if you see things like encryption (TDE) in the environment differences because encryption uses additional CPU to encrypt the pages (although the differences should not be big). Sometimes it is important to correlate differences from the first step with the findings in the following sections.
Managed Instance is using 5.1GB/core RAM and this might not match your SQL Server environment memory. However, this might not be the key factor for the performance difference. You should check is there a difference in memory usage and not total amount of memory.
One of the most common wrong metrics that you can monitor is does the Managed Instance use 100% of available memory. Some people believe that this is an issue (like hitting 100% CPU), but this is expected and desired behavior. Managed Instance should use as much as possible memory to cache the pages from disk into the buffer pool. The only case where you will not see near 100% usage of memory is the case where you have the databases much smaller that the available memory size so all of them can fit into the memory.
The better metric for memory usage monitoring is ‘Page Life Expectancy’ performance counter:
select object_name, cntr_value
where counter_name = 'Page Life Expectancy'
and object_name like '%Buffer Node%'
If you run this query both on Managed Instance and SQL Server (several time to get the average value) and the results show that the page life expectancy is smaller on Managed Instance, then maybe you don’t have enough memory.
If you want to learn more about the expected value for PLE read David Williams or Jonathan Kehayias articles, but for the quick check it would be good enough just to compare these values. You can also find more information how to identify does your Managed Instance needs more memory here.
Check why Managed Instance is waiting
If Managed Instance and SQL Server are using the same amount of resources and your queries are still slower, then you need to investigate does the workload waiting for some resources.
Wait statistics in Managed Instance are easy way to troubleshoot performance differences because they are telling you why the database engine is waiting for some resources instead of executing the workload. There are several types of wait statistics that you can analyze:
- Global/Instance-level wait statistics that shows what are the global statistics that are holding-up your workload. You can use sys.dm_os_wait_stats DMV to analyze these statistics.
- Query-level wait statistics (from Query Store) that show why individual queries are waiting.
- Session-level wait statistics using sys.dm_exec_session_wait_stats view.
You can find how to analyze wait statistics in this article. Some of the most common wait statistics that might slow the workload are:
- WRITELOG — meaning that Managed Instance is waiting to write records in log (.ldf file). This is expected wait statistic because databases are always writing log records and they need to wait for the records to be persisted. If this statistic is high, maybe your workload requires more IO/throughput on log file and in this case you can increase the log file size on the database to get better IO performance.
- PAGEIOLATCH_* statistics such as PAGEIOLATCH_SH or PAGEIOLATCH_EX that are indicating that the database engine is waiting to fetch or save data pages from memory to data file. In General Purpose tier, you might fix this issue by pre-allocating the data file.
- INSTANCE_LOG_RATE_GOVERNOR means that managed instance is balancing amount of data that you are inserting/updating and amount of data that will be backed-up. Currently, Managed Instance allows you to load up to 22MB/s on General Purpose Instance and 48MB/s on Business Critical instance. This wait statistics appears if you are reaching the the log rate cap on Managed Instance. Log-intensive operations such as REBUILD INDEX or BULK INSERT might be affected by this limit.
- HADR_ wait statistics on Business Critical (for example hadr_sync_commit) are indicating that secondary replicas are slowing down primary because they need to catch-up the changes. This wait statistics ensures that your updates are properly replicated to all redundant nodes.
Try to identify some dominant wait statistics in Managed Instance that do not appear in SQL Server and try to find what is causing the wait statistics such as IO characteristics on some files or some specific queries. Depending on the results, you should check IO file performance or analyze the queries that are causing the issues.
Check file IO performance
One of the most common reasons that are limiting workload in any database engine (in cloud or on-premises) are IO performance of the underlying storage system. Therefore, is it important to measure IO performance (latency, IOPS, throughput) on your files, especially if you see WRITELOG and PAGEIOLATCH as dominant wait statistics. You should check file IO statistics on the following databases:
- Database where you are running the queries.
- TEMPDB especially if you are using temp tables or there are spills in your query plan.
In this article you can see how to measure IO characteristics/performance of database files. In the article is explained the script that you can use to measure IO performance based on virtual io stats DVM:
In you notice that IOPS/throughput is close to the limits defined in Managed Instance documentation, then you are hitting the limits of the infrastructure where the Managed Instance is running. In addition, if you see a big difference between
write_io_latency this is indication that you are hitting the performance issues.
You can fix IO performance issues if you are hitting the limits of individual files on General Purpose instance by pre-allocation data or log files. In General Purpose instance, performance of the files depends on the file size as described below:
If you see that you are hitting the limit of the Azure remote files, you can increase the size of the file to get better performance.
NOTE: Increase the file size only if you know that this is a bottleneck. If you don’t see WRITELOG or PAGEIOLATCH wait statistics in the wait statistics analysis, increasing the file size might not help.
If the environment settings are similar and there is no significant difference in wait/io stats, it is time to investigate are there any SQL query plans that are causing the issues. The goal of this analysis is to identify the queries that are fast on SQL Server and slow on Managed Instance.
You should also do this part if you notice that there is some dominant wait statistics or CPU usage difference and if you identify the queries that are causing the difference. In this article you can find how to identify the queries that are causing dominant wait statistics.
When you identify the queries that are running differently on SQL Server and Managed Instance, the first thing you need to do is to verify that the SQL execution plans are the same.
If the SQL plans are identical then probably you are hitting some limit or there is some background job that is eating your resources while your workload is running. If the plans are different, that might be the cause of changed query performance and you should investigate why they are different.
If the plans are not the same, then one plan use more IO, CPU or some other resources.
Take a look again at the differences in environment settings and try to find some things like compatibility_level, MAXDOP, trace flags that could cause the differences.
Save the plans as
.sqlplan files and compare the .sqlplans in SQL Server Management Studio as shown below:
The things that you need do track are:
- Check are there some warnings in the plan that are causing the issues.
- Follow the thick lines in the plan and see the properties of the operators on these paths to compare the estimated and actual number of rows and other statistics, find the operators with a lot of TEMPDB spills, and other differences. If there are differences in the estimated/actual, you might need to rebuild statistics (or indexes) on the affected tables. Check IO statistics on TEMPDB because the speed of spills might depend on these parameters.
- Try to find some bigger difference in the SQL plan shape like an operator that don’t exists in SQL Server plan. It is important to investigate why Managed Instance choose different plan, and if needed you should force different plan or apply some query hint.
- Check is the .sql plan slow due to compilation timeout. Compilation timeout means that Query Optimizer cannot find the optimal plan and it is returning the best plan he can get, although it might not be the optimal one.
Sometimes you might fix the performance issue by forcing the good plan, changing compatibility level, legacy cardinality estimator, or MAXDOP. In addition, you might consider adding some indexes such as column-store indexes that can improve performance of your workload especially if you have not used them if you used older versions of SQL server.
Check background jobs
Finally, you should check is there some additional workload that is affecting performance. The scripts that are rebuilding indexes or loading data might impact performance. In addition, automatic backups on Managed Instance use some resources that might impact performance.
In this article you can see how automatic backups affected performance of HammerDB performance test:
Wait until any background job finishes to continue the performance test in order to get the reliable results, or implement similar backup process on SQL Server instance to guarantee similar reliability.
In some cases you cannot avoid background jobs. If you have Managed Instance with 100 database you will always have some backup running in the background because all of them must be backed-up in 5–10min.
Performance troubleshooting and identification of the issues that affect performance is not an easy task. In this article you can see tips that can help you to identify the reasons that might cause the performance differences and some guidelines that can help you to improve performance of your workload.