Running HammerDB on TPCC96 database on General Purpose Azure SQL Managed Instance
HammerDB is one popular and easy-to-use tool that a lot of people are using to benchmark performance of the databases. In the previous article, I have used 100 HammerDB users to measure performance of the workload on TPCC database with 5000 warehouses, and the conclusion was that in this configuration we can reach up to 200K transaction/minute and that probably the main bottleneck is the fact that 40GB memory that is available on 8-core Managed Instance is not enough to cache the data for 500GB database.
I wanted to repeat the same experiment with a smaller database that can fit into the memory of 8-core Managed Instance, so in this experiment I’m using much smaller TPCC database with 96 warehouses and 16GB size.
Parameters of the experiment are the same as in the previous one (8-core General Purpose instance, 100 users in HammerDB). Like in the previous article, I’m using Query Performance Insight library to more easily measure performance of the workload (when you see a query that uses a view or procedure in a
qpi schema this is a custom view from this library).
Although I known that I should increase and pre-allocate data files to get the better performance in TPCC workloads, I started with the original un-modified database to see is pre-allocation necessary even on the smaller databases. The results that I’m getting when I run 100 virtual users on this database is shown in the figure below:
Performance vary between 15K and 30K, which is much less than the value 200K that I got on bigger TPCC5000 database, so I started investigating the performance issues.
CPU usage is between 5% and 10%, with the peaks up to 20% so there is no CPU pressure and the workload is waiting most of the time instead of working. If we analyze the wait statistics (as it is described in this article), we could see that PAGEIOLATCH_* statistics are dominant wait statistics meaning that Managed Instance is probably waiting for the pages to be fetched from the remote storage.
In order to verify this assumption, I’m looking at the file statistics using the scripts explained in this article:
Important thing here is that read/write latency on data file is 1–2 seconds, and it is much higher than read/write latency (<5ms) of the underlying IO system (Azure Premium disk storage) on the same file. This indicates that the remote storage is serving pages with the expected latency (below 10ms); however, IO requests are queued in the database engine and waiting to be served. We can see that the data file uses 459 IOPS which is close to the 500 IOPS limit for P10 file described below, so probably it is hitting the IOPS limits of the file on the remote storage. Database engine can know that the IOPS on this file is close to the limit so it will queue the requests in order to avoid throttling on the remote storage side.
File performance of General purpose Managed Instance depend on the file size as it is shown in the table below, and the smallest files (less than 128GB — or P10 type) are getting up to 500 IOPS and 100 MB/s.
Note that IOPS values provided by DMVs will probably never match the limit from the table above because DMVs are showing the average values taken between snapshot and SELECT query (unless if you are waiting for a really short time and succeed to catch the peak usage). However, IOPS value is close to the limit and the high difference between latency and IO latency is certain indication that the storage limit is reached.
Apparently, even the small databases need pre-allocation in order to fix the performance issues under heavy load. In the following step, I will repeat the experiment with a bigger data file.
Pre-allocating data file
In General Purpose tier, I can improve performance of a file by increasing the file size. The following statement will pre-allocate data file to 140 GB (or any value greater than 128GB) and it will go into P15 category:
ALTER DATABASE [tpcc96]
MODIFY FILE ( NAME = N'data_0', SIZE = 140 GB)
Increasing the data file size immediately improved the performance of HammerDB from 20K to 300–450K transactions per minute:
In addition, the wait statistics prove that this PAGEIOLATCH_* bottleneck is gone:
File IO statistics still show some issues with the latency, and I see that IOPS for data file is close to 1100 (limit for P15), which again indicates the issue with storage:
I could increase the data file size even further to improve IO performance, but it might not improve overall workload performance because PAGEIOLATCH is not shown as something that is slowing down the workload. Since I see WRITELOG as dominant wait statistic and also high latency on the log file, I would try to fix the performance by pre-allocating the log file. This makes more sense since I’m getting high IO latency on log file, which might indicate that this is more severe than IO performance on data file.
Increasing log size
Similar to the previous case, I’m increasing the size of log file in order to try to get better performance:
ALTER DATABASE [tpcc96]
MODIFY FILE ( NAME = N'log', SIZE = 140 GB)
This change improved performance even further and I’m getting ~500K transactions per minute:
By looking at the wait statistics, I could see that WRITELOG is still dominant wait statistic:
WRITELOG cannot be completely removed because database engine is constantly writing records in log file, so it would always need to wait for them to be saved there. However, the goal is to minimize these waits as much as possible.
When I look at the file statistics, I can see that there is no high latency on log file, and the latency on data file also improved:
There is still some latency on data file, which is higher that IO latency, but since PAGEIOLATCH_* is not a dominant statistic, probably there is no point of increasing data file. Just to be sure, I have increased the data file, and I got the better IO results(smaller latency), but without some big effect on overall performance of HammerDB.
Therefore, I can conclude that pre-allocating data and log file sizes above 128GB is good enough to get 500K transactions/minute on 96 warehouse database with 100 HammerDB users.
Memory and performance counter analysis
In order to compare results with the analysis done in the previous article where I did an experiment with TPCC 5000 database, I’m also analyzing memory usage and performance counters.
With pre-allocated data and log files, Managed Instance is using around 70% of CPU (with is much better that initial usage that was up to 20%), and this value drops to 10% when log backups are taken (I see the same performance drops when backups are take as in the TPCC5000 article).
When looking at the memory, I can see the following results:
9GB data is in Managed Instance memory out of 16GB total size of the database. I cannot explain why all 16GB are not in memory, but it is much better than the previous TPCC5000 case where just 20GB out of 500GB was cached in the memory.
In the TPCC5000 experiment I got the page life expectancy below 100 (meaning that the pages will be flushed from the memory in less than two minutes after loading), and this was one of the main reasons that slow down performance on HammerDB workload. In this case, page life expectancy is much better than the one in the previous experiment:
Finally, in order to compare performance of workloads, I have retrieved information about the batch duration:
Most of the batches finish under 2ms with better than in the previous case.
Performance of smaller TPCC database (with 96 warehouses) is much better than the large database with 5000 warehouses and can reach up to 500K transactions/minute. However, the improvement is not linear — 5000 warehouse database could handle up to 200K transactions per minute.
The main reason that caused performance improvement is the fact that 8-core Managed Instance with 40GB memory can efficiently cache a large portion of database so page life expectancy is much higher than in the previous case.
Important finding is that pre-allocation of data/log files is still necessary even for small databases if there are under heavy usage (100 concurrent users) and require high performance.