Google Cloud Platform CPU Performance in the eyes of SPEC CPU® 2017 — Part 2

Published in

Google Cloud - Community

7 min readFeb 6, 2023

In the first part, we talked about the reasons for all of this, what’s SPEC, the benchmark suite, PKB, how to set it up the system, and concluded with running a sample test.

In this second section, I’d like to go more deeply into the details of the PKB wrapper, which embeds in lots of (hopefully 🤣) well-thought-through choices.

GCP Region

europe-west4, the GCP region in the Netherlands, is one of the few regions with most (if not all) services. From a CPU standpoint, europe-west4 has it all. Although running PKB in different regions is easy, I wanted to minimize the number of variables, hence why everything is there.

Regions and zones | Compute Engine Documentation | Google Cloud

Whether your business is early in its journey or well on its way to digital transformation, Google Cloud can help solve…

cloud.google.com

Why R̶H̶E̶L̶9̶ I mean why Rocky Linux 9?

Red Hat Enterprise Linux is one of the most (if not the most) enterprise-grade Linux distros available, which has a large ecosystem and it’s the most OS used for official SPEC result submission (well, at least for the past 3 years).

CPU2017 Results -- Results

This is the search interface to the more than twenty thousand (20,000) benchmark results published by SPEC.

www.spec.org

Rocky Linux is the spiritual successor of CentOS

I didn’t specifically choose RHEL, and instead went for Rocky Linux, because first Rocky Linux (like CentOS back then) is open-source and designed to be 100% bug-for-bug compatible with RHEL. Secondly, CentOS was a furnace for many open-source projects, contributing to and enabling the great RHEL ecosystem. Using Rocky Linux is IMO a proper way to support it. The next article tries to sum all of this up:

The killing of CentOS Linux: 'The CentOS board doesn't get to decide what Red Hat engineering teams…

Interview Brian Exelbierd, responsible for Red Hat liaison with the CentOS project and a board member of that project…

www.theregister.com

Taking Rocky Linux 9.1 was merely a choice of going with the latest available version, built on recent technologies like the Linux Kernel 5.14, glibc-2.34, and binutils-2.35.2, and built with recent tools like GCC 11.2.

Building the compilers?

When performing benchmarks and the aim is to really compare things, it’s strictly important to reduce the number of variables to a bare minimum. I wanted to use the most recent GCC version (at the time of this work 12.2), compiled in the same way, across the board.

Taking also the latest (stable) GCC gives lots of benefits around CPU µarch optimization available. This is indeed not representative of the real world, but hey, I wanted to ensure the latest architecture (Intel IceLake, AMD Milan, etc) gets the most possible optimization, especially considering how slow AMD is on this front.

Which flags for SPEC CPU?

Here, I have to admit, I was heavily inspired by Dr. Ian Cutress and Andrei Frumusanu both former AnandTech editors and cornerstones of the performance analysis world.

For the base executions, I also wanted to ensure a fair playground for (I have to admit here) most CPUs:

x86–64 architectures prior Intel Haswell: -Ofast -fomit-frame-pointer -march=x86–64-v2 -mtune=corei7-avx this effectively enables x86 optimization for a core equal to a Sandy Bridge (an Intel architecture of 2011)
x86–64 architectures from Intel Haswell onwards: -Ofast -fomit-frame-pointer -march=x86–64-v3 -mtune=core-avx2 this effectively enables optimization for a core equal to a Haswell (first Intel architecture providing AVX2). Bare in mind that 1st until AMD Zen4 (Genoa), AVX512 was not available from both vendors, and even worse, AVX512 is deeply fragmented within Intel. Linus Torvalds had an option about it 🤣 a few years ago:

Linus Torvalds: ‘I Hope AVX512 Dies a Painful Death’

AArch64: -Ofast -fomit-frame-pointer -march=armv8.2-a

While, for the peak execution, for all CPUs -Ofast -march=native -flto enabling all possible optimization available (hence why is important to have a recent compiler).

How many times running a benchmark?

Certainly not one, two could be biased, then 3 seems a good option. Of course, the more the better, but there is also a limit my credit card can take 😭

With three iterations, some architectures take as long as 33 hours to complete while a few just a bit more than 10 hours. On average, from start to end, we need about 15 hours for single-thread results and 25 hours for the multi-thread scenario.

Where are my results?

From this point of view, the PKB wrapper is a bit of a garage solution.

The root directory is under /root/tmp then the CPU architecture name is added like pkb.n1-snb for Intel Sandy Bridge or pkb.n2-clx for Intel Cascade Lake. Two sub-directories are created respectively for single (1T) and multi-thread (nT) execution.

If you’re looking for the results during the execution, connect to the target system (IAP is enabled), and you will find them at /scratch/cpu2017/result.

Why 16 vCPU?

Because 16 vCPU will be bound to 8 physical cores and 8 is the luckiest number in China 😆

All jokes aside, in GCP, every CPU architecture is available in the shape of 16 vCPU which is the real deal when comparing the results. Indeed not every machine family enables SMT and this is actually great to test the core throughput. When SMT is not available (at this point T2D and T2A), both 8 and 16 vCPU are tested. Google doesn’t publicly advertise the NUMA affinity nor the Cluster-on-Die/Sub-NUMA-Cluster/NUMA node per socket configuration. We just have to hope the configuration is good enough to map CPU and memory as close as possible. Choosing a small (but not too small) machine layout like 16 vCPU should give (or at least I hope) enough affinity.

Another option was to choose the biggest available machines, the problem was literally an economical one. The more core there are, the longer the test takes, and consequently, the higher the cost will be. If I find a way, I’ll work on this.

A great deal of time was put into comparing 16 threads vs. 16 cores (and you can already guess which one delivers the most).

THP, what are these? Why enabled?

THP or Transparent HugePages. Traditionally memory in computers is organized in chunks of 4kB. HugePages is a concept rather new (like from the 90s) where memory could be allocated in much bigger chunks (2MB and 1GB) to waste fewer CPU cycles when looking for and retrieving the data, relying on the TLB, allocating fewer virtual memories and thus improve performance. THP tries to do all of these without the typical HP downsides such as the static memory allocation and programs written to take advantage of it. Follows the official SPEC answer

Follows also the official Red Hat documentation

Chapter 39. Configuring huge pages Red Hat Enterprise Linux 9 | Red Hat Customer Portal

Physical memory is managed in fixed-size chunks called pages. On the x86_64 architecture, supported by Red Hat…

access.redhat.com

How to run single- and multi-thread executions? And why?

This is handled by the PKB CLI as well as the SPEC config file.

First off, it’s fundamental to capture the single-thread performance to understand the IPC of a given CPU. This also represents the best possible test conditions because nothing is stressing the memory subsystem nor the various CPU caching layers and even from a thermal point of view, we shouldn’t reach any limit (aka higher clocks).

Multi-thread testing, on the other hand, gives a different perspective, stressing the memory and putting as much load as possible on the CPU. Under perfect circumstances (aka no hardware and software limits), we can see how well the overall system — made of CPU, memory subsystem, OS, its tuning, and especially the CFS scheduler config — let the workload scale. SPECrate tests take the parallelization approach of using OpenMP which is a set of libraries to support the development of multiprocessing programming with shared memory features build-in.

Two executions

Let’s also touch base on that: PKB is gonna be executed twice, the first time for a full INT and FP Rate in single-thread mode and the second time, still for a full INT and FP Rate execution but this time around multi-thread leveraging all the available vCPU.

Bringing it all together?

The wrapper itself is just shy of 120 lines, many of which are duplicated due to the dual execution.