Running HammerDB on TPCC5000 database on General Purpose tier of Managed Instance

HammerDB is a popular benchmark tool that simulates TPCC workload and that a lot of people use to benchmark different environment. In this article I will explain how to run HammerDB Managed Instance and analyze performance on Managed Instance.

WARNING: HammerDB is not official standard and don’t have precise specification of prerequisites for running the tests. You might get the different results if you repeat the same test that depend on the number of core of your client VM, network bandwidth between Managed Instance and VM, number of other databases on Managed Instances, etc. There are many factors that can affect the results that are outside of the scope of this article. Do not rely on the performance numbers described in this article as performance targets based on official benchmark.

Environment settings

In this experiment, I’m using the following settings on Managed Instance:

  • 8 core/40GB RAM General Purpose Managed Instance
  • TPCC database generated by HammerDB with 5000 warehouses.

There parameters are important because you will get different performance characteristics on smaller/bigger databases, or Managed Instances with more cores, and the differences would not be proportional to the size or number of cores.

In addition, I’m using open-source Query Performance Insights library that helps me to analyze performance on Managed Instance. This is not mandatory, but makes analysis easier.

In this experiment, I’m using default HammerDB settings without keying and thinking time. the only custom option is the number of virtual users that is set to 100 because I’m using 12.5 users/core and I have an 8-core instance:

In order to avoid potential problems with file size caps I’m pre-allocating data file to be greater than 512GB because this is common hidden issue in HammerDB TPCC workload.

Try to restore fresh TPCC database before your testing. I noticed that performance slowly drop after few hours of running HammerDB, so always start with the clean database.

Results

I’m running HammerDB against TPCC5000 database and as result I’m getting 160K — 200K transactions per minute:

CPU usage is around 50% so there is no CPU pressure. As another interesting metric, I’m looking at the distribution of batch requests/second from performance counters collected on Managed Instance:

select name, percentage = 100 * value/sum(value) over(partition by instance_name)
from qpi.perf_counters
where name like 'Batches%'
and instance_name = 'CPU Time:Requests'
order by name

The result of this query shows the distribution of batch response times where I can see that half of the requests finish in a less than 1 millisecond, and the most of the batches are under 2–5ms:

Distribution of batch response times

In order to find out is there something that is holding-up the workload I’m analyzing wait statistics on Managed Instance using qpi.snapshot_wait_stats procedure that cleans accumulated wait statistic, and qpi.wait_stats that returns values of the wait statistics since the last snapshot:

Wait statistics when running HammerDB on TPCC5000 database

The dominant wait statistic in this case is PAGELATCH_EX that indicates that multiple requests are competing for the same page in memory.

Although wait statistics don’t show any dominant PAGEIO wait statistic, I will analyze file statistics to see are there any IO bottlenecks. qpi.snapshot_file_stats procedure that takes a snapshot of accumulated file statistics and qpi.db_file_stats view that returns values of the file IO statistics for the current database since the last snapshot. The results are shown in the following picture:

Values for write_io_latency and read_io_latency are in expected range (5–10ms) but total latency is greater than 100ms. This is not the biggest issue since PAGEIOLATCH_* wait statistics are not dominant but it might be related to the fact that on Managed Instance IO characteristics depend on the size of data file:

IO performance depends on the file size

Although the wait statistics don’t show IO as dominant IO dominant issue, I will try to increase data file size above 1024 GB to see what would happen:

ALTER DATABASE [tpcc5000]
MODIFY FILE ( NAME = N'tpcc5000', SIZE = 1100 GB)

With this setting, my data file is moved from P30 category to P40 category and it is getting more IOPS/throughput.

As a result, I see that IOPS and throughput are doubled (from 1500 to 3000 IOPS and 18MB/s to 36MB/s), and that IO latency is additionally decreased to expected 5–10ms for General Purpose instance, which is better:

However, when I look at the HammerDB results, I’m not getting some improvements. Hammer is still working with 150K — 200K transaction per minute:

The explanation can be found in wait statistics shown below:

Wait statistic after increasing data file size

As you can see, the dominant wait statistic is still PAGELATCH_EX meaning that although I improved IO performance of the storage, this didn’t affected performance. In this case, 100 concurrent users are still competing for the memory pages and better IO in this case would not help.

I’m taking few additional metrics to get more insights in the system. By looking at master.sys.server_resource_stats view, I see that CPU usage varies between 60 and 70% so the CPU is not a bottleneck. If I look at the memory used per database I see that tpcc5000 database use most of the 25GB of the buffer pool that is available, but still it cannot get more than 24GB to cache the data, which might not be enough for a database with 500GB data.

Another interesting parameters is the value of the Buffer Node: Page life expectancy that is in this experiment always below 100:

Low page life expectancy value

This is much below common boundary value of 300 and indicates that there is a problem because expected time that page resides in memory is much lover than expected and this might indicate that there is not enough memory for this type of workload.

Impact of log backup

One interesting thing that I’m noticing is that performance of the workload drops in some periods of time and I’m getting around 40K transactions per minute.

If I look at the ongoing log backup requests using qpi.bre view I can see that there is an ongoing BACKUP LOG statement. I’m reading data from the view where I can see ETA for the end of LOG BACKUP. I left HammerDB running after I executed this query to see would the performance go back to expected 150-200K/min once the LOG BACKUP finishes:

Impact of log backup on performance

In addition to performance drop, I could see that CPU usage also drops to 10% in this period. The interesting finding here is that ETA for finishing log backup 2019–04–25 08:31:52 matches the time on the HammerDB graph where I’m getting good performances again. At 08:32 in this graph HammerDB returned back to 200K transactions per minute.

Graphs in HammerDB are not accurate because they are collecting performance information every 30 sec. This value could be decreased but then HammerDB will affect performance.

If I analyze wait statistics while the log backup is running, I could see the following results:

In this case, BACKUP LOG takes most of the IO resources and the dominant wait types are BACKUPIO and WRITELOG. Performance of the HammerDB drops and there are no queries that competes for memory pages.

If you are seeing that your performance drops, take a look at the backups and new wait statistics because this would be the common cause of performance degradation.

Conclusion

In this article, I have demonstrated how to run the HammerDB on your database hosted on Managed Instance and we can expect between 160K and 200K transactions/minute with 5000 warehouses TPCC database is the data file is properly pre-allocated.

The main bottleneck for this type of workload and environment is small amount of memory which is not sufficient for 500GB database that causes very low page life expectancy. Note that you might get different results with different size of the database, especially if you use smaller database that can fit into the Managed Instances buffer pool. Also, you might notice that log backups might have high impact on performance so you should check is there some backup running while you are executing your workload.

You might see the results of the same experiment on much smaller TPCC 96 warehouse database in this article.

More important, in this article you can find the scripts that can help you to what are the bottlenecks in your workload and how to identify what can be improved. As an example, although increasing the data file size might be beneficial for your workload, in some cases it would not help if this is not your dominant issue.