Running HammerDB TPCC 1000 test on General Purpose Managed Instance
One of the most important tests that you need to run on your database is performance test that shows you how many queries, requests, or transactions you can run.
In the previous articles I have used HammerDB to run a simulation of TPCC workload with 100 users on 8-core General Purpose Azure SQL Managed Instance. I have used the following databases in these experiments:
- Big TPCC 5000 warehouse database that could handle 200K transactions/min. The biggest issue in this experiment was the fact that 8-core Managed Instance has 41GB RAM while TPCC5000 database had ~500GB size so only a small portion of database was in the memory.
- Small TPCC 96 warehouse database that can handle ~500K transactions/min. This database was much smaller so it could fit into the RAM allocated to Managed Instance.
The interesting finding in these experiments was the fact that I’m always hitting the IO limits on the files placed on Azure Premium Storage, and that I had to increase the sizes of both data file and log file and pre-allocate them in order to achieve the optimal performance.
In this article, I’m running the same experiment with the TPCC database containing 1000 warehouse (~100GB size) in order to see what would be the performance in this case. Like in the previous articles, I’m using Query Performance Insight library to more easily measure performance of the workload (when you see a query that uses a view or procedure in a
qpi schema this is a custom view from this library).
Although I know that in every HammerDB/TPCC experiment I had to pre-allocate the file size, I’m running the first test on the original ~100GB database just to see what would be performance if I don’t change anything.
The results that I’m seeing on HammerDB are shown in the figure below:
Without any pre-allocation of the files, I’m getting between 10K and 25K transaction/min. If I analyze the wait statistics and file statistics while HammerDB is running, I’m getting the following results:
The dominant wait statistics are PAGEIOLATCH statistics meaning that Managed Instance is trying to save/load data to data files. If I look at the file statistics, I can see high write/read latency (up to 1.5sec) that indicates the issue with IO throttling. Note that IO latency of the remote storage is still fine (3–4ms); however, there are many requests in queue waiting to be processed. I can also see that I’m hitting the IOPS limit of 500 that is maximum IOPS limit for the smallest Azure Premium files (P10 class).
In order to fix the IO issue, I can increase the size of data file so it will go into the higher class and get more IOPS:
I can increase the data file size using the following T-SQL statement:
ALTER DATABASE [tpcc1000]
MODIFY FILE ( NAME = N'tpcc1000', SIZE = 140 GB)
When I executed this statement, I immediately noticed that HammerDB is processing more transactions and it is increasing the speed to ~100K transactions/minute:
Analysis of wait/file stats shows me that PAGELATCHIO bottleneck is gone:
Now I can see that my new dominant wait statistic is WRITELOG that might imply that now I have problem with the log file IO. By looking at the read/write latency for log file, I can see that it is very high, and this is the evidence that I have a problem with log file IO.
As in the previous step, the solution is to pre-allocate the log file. Increasing log file size to 140GB (or any value above 128GB) also improved performance of the HammerDB workload:
If I analyze wait statistics again, I could see that PAGEIOLATCH is again dominant statistic. It seems that now log file can successfully serve all requests, but again communication with the data file is the main bottleneck:
High read/write latency also proves that the pressure is on the data file again. I tried to increase the data file size to 260GB to go to P20 class, but without any effect — the results were the same with similar wait/file statistics. It seems that P20 file IO performance are still not good enough for this workload.
Therefore, I have increased the data file size to 530GB in order to get P30 performance:
This change helped and I see that I’m getting 400–500K transactions/min.
Analysis of wait/file stats shows that PAGEIOLATCH is not dominant statistic anymore and that WRITELOG is again the top wait statistic.
By looking at the write/read latency on the log file, I see that the values are expected 4–5ms. Therefore, I can assume that WRITELOG is an expected wait statistic that is not caused by IO performance issues.
Just to try my luck and test this hypothesis, I increased the data and log files even further although wait statistics are not showing that file IO is an issue. However, this made not big effect — HammerDB is still running on 400-500Ktpm. IO latency is better but this was not a bottleneck:
By looking at these results, I could see that increasing file size makes sense only if the IO is an issue. If a combination of wait/file statistics don’t show some issue, increasing file size will improve IO performance but without effect on workload.
Just to summarize the final state of the workload, I have also analyzed CPU/memory usage. CPU percentage is around 70-80% so this is not a bottleneck. By looking at the memory I can see the following results:
TPCC 100GB database is using 24GB buffer pool meaning that a ~25% of database is always cached.
Performance counter analysis shows that the page life expectancy is around 300–400:
Page life expectancy is acceptable and much better than the value 70 that I got in the TPCC5000 case. The interesting finding is that TPCC96 database (14GB) had very high page life expectancy but with the similar tps.
This was a third performance experiment with 100 HammerDB users on TPCC database. The conclusion is that if an amount of RAM on the Managed Instance is enough to cache at least 25% of database, we can expect up to 500K transactions/min (experiments with small and medium-size database). The big database that cannot fit into RAM causes low page life expectancy and the performance can drop 2–3x compared to the ideal case. However, even performance of a big database are not bad — 200K transactions/min might be good result for half TB database and 100 users.
In all cases it was necessary to pre-allocate both data and log file to get the optimal performance. For TPCC workload, IO limits on the remote storage are the first thing that you might hit and pre-allocation will help. Note that you should pre-allocate files only if wait/file statistics show that there is a problem. If there are not issues with latency and wait statistics, increasing the file size might not improve performance.