Increasing data files might improve performance on General Purpose Managed Instance tier
Azure SQL Database Managed Instance is a fully-managed SQL Server database engine hosted in Azure cloud. In General Purpose architecture of Managed Instance, SQL Server database engine is placed on a stateless compute VM that contains only tempdb, while all other database files are placed on remote Azure Premium disk. This architecture ensures high availability and minimizes chances of data loss since the Azure Premium disks are internally replicated.
One of the characteristics of the Azure premium disks is that they provide better performance if your database files are bigger:
In some cases, you might experience performance issues because your data files are small and they don’t get enough IOPS/bandwidth. In this case you can easily fix the performance issues by increasing data file size.
In this story, you will see how to identify that the performance issues are caused by the IO bottlenecks on Azure Premium disk due to data size and how to improve performance of your workload by increasing storage size.
In this experiment, I’m using the following tools/libraries:
- HammerDB for generating TPCC-like workload,
- Query Performance Insights library for analyzing performance on Managed Instance.
HammerDB on small data files
I have restored 100GB TPCC database generated by HammerDB. The database has one data file smaller than 128GB, so it is getting minimal guaranteed performance characteristics (500 IOPS and 100 MB/s).
I have configured HammerDB with 100 users running on the database and I’m seeing the following results:
Performance of this workload vary between 10K–20K transactions per minute, which is not great.
One thing that I can do is to analyze why the Managed Instance is not running faster. Once I have installed QPI version for Managed Instance, I could analyze wait statistics using the following queries:
exec qpi.snapshot_wait_stats; -- Take the baseline
<wait while HammerDB is running>
select * from qpi.wait_stats; -- Get the wait stats
Now I could see that top wait statistics are from PAGEIOLATCH category:
PAGEIOLATCH indicates that Managed Instance is waiting for the data pages from the files placed on remote storage. This might indicate IO storage bottleneck and I can verify this by looking at the file statistics.
In order to check is there an issue with file IO statistics, I could use the following queries:
exec qpi.snapshot_file_stats; -- Take the IO baseline
<wait while HammerDB workload is running>
select * from qpi.file_stats; -- Get the file stats
The results of the second query are shown below:
The interesting things that could notice here are:
- iops on data file is 489. This is very close to the 500 IOPS limit for the smallest files on Premium Disks (see table above).
- Total read and write latency is very high — around 2 seconds and much higher than underlying storage IO latency which is expected 3–4ms. Managed Instance. The difference between total latency and IO latency indicates that Managed Instance is limiting workload and queuing IO request in order to avoid IO limits.
IOPS close to disk limit and high latency are indicators that Managed Instance is trying to push data to data files but it is limited by file IO performance.
You will never get the values that exactly match the limits in the table above because the queries are returning the average IO values in some period of time (between snapshot and the query), while Azure Premium storage measure runtime IOPS/throughput usage and throttle requests if the limit is reached. However, the values close to the limits are indication that your have IO issues.
Pre-allocate data file
In order to try to fix this issue and add better IO characteristics to my storage, I will try to increase the size of data file:
ALTER DATABASE [tpcc1000] MODIFY FILE ( NAME = N'tpcc1000', SIZE = 540 GB)
540 GB is slightly bigger than 512 GB limit and it will put my data file in P30 category with (5000 IOPS and 200 MB/s).
When I look at HammerDB again, I’m getting the following results:
Performance of my workload increased from 10K–20K to 450K-550K transactions per seconds.
If I look at my wait statistics again, I’m getting the following results:
My top wait statistics in not PAGEIOLATCH anymore and my top limit is WRITELOG.
If I look at the file statistics again I can see the following results:
Now I’m getting 2800 IOPS on the data file and write latency 10–50ms which is much lower than the previous case. Also, it is much closer to underlying IO latency, which indicates that it don’t hit the Azure Premium disk limit on data files.
Architecture of General Purpose tier on Managed Instance introduces dependency on file size where the bigger file sizes provide better performance. This might introduce some unexpected performance issues if you have small files that require a lot of IO requests, but you can easily fix this issue by pre-allocating your files. In this case I’m getting 50x better performance by increasing the size of the file, but the actual improvement might vary in your workload.
Do not immediately increase the file size if you believe that you might have performance issue on Managed Instance. Make sure that you analyze wait statistics and file IO statistics to prove that IO limits on data files are actually the bottleneck before you increase the file size.
If you would like to learn more about Managed Instance performance tips & tricks, you might see the following articles: