Google Cloud Platform CPU Performance in the eyes of SPEC CPU® 2017 — Part 3

Federico Iezzi
Google Cloud - Community
10 min readFeb 9, 2023

In this 3rd and nearly last part of the SPEC challenge we’re gonna deep-dive in the result, so let’s get to it.

First, let’s check the platform characteristics:

  • The test span over 15 different machines;
  • As mentioned in Part 2, we’re comparing machines with eight physical cores/16 vCPU (and in some cases, 16 physical cores);
  • Rocky Linux 9.1 across the board using kernel 5.14.0–162.6.1.el9_1.0.1;
  • We have all the major Intel CPU architectures released since Intel Sandy Bridge all the way to Intel Ice Lake (and hopefully soon Sapphire Rapids with C3). Regarding AMD, the starting point is Zen 2/Rome. There is also ARM here, represented by Ampere Altra;
  • SPEC CPU 2017 runs single- and multi-thread, in the Rate flavor for both Base and Peak.
All CPUs compared with their respective CPUID, NUMA topology, SMT layout, and clock speeds

From this table above, we can immediately see the relatively low clock speed of Intel Skylake and AMD Rome. Notably, the single-core turbo of Ampere Altra is the worst of the entire bunch while, at the same time, the best base clock available. AMD Milan has the same clock speed across the board (keep this in mind because the SPEC results will tell a very different story). Lastly, the CPU with the highest clock speed (like it still means something these days) is C2 Intel Cascade Lake.

SPEC estimated scores

Before going into the performance number, I’d like to point out that all SPEC CPU® 2017 numbers are estimated scores. This is due to the lack of official submission and review by the SPEC body.

SPEC CPU® 2017 single-thread estimated scores

In this first table, we can already start drawing some conclusions. First off, we have both Integer and Floating Point results in Single-Thread mode. These results are for the Base (aka same compiler flags across the board) variant.

  • We can see that any N1 machines have the worst performance of the entire group, these are also the older architectures;
  • Meanwhile, C2D Milan is the overall best, losing only in FP 507.cactuBSSN_r and 538.imagick_r;
  • N2 Cascade Lake and N2D Rome are kind of average performers. Arguably AMD has superior FP performance while ties in INT result (essential in any circumstance);
  • N2 Ice Lake, which comes at no additional cost, is a much better solution than Cascade Lake scoring both a strong INT and FP result, respectively +18% and +29%, these are some serious generational improvements;
  • We can see a similar pattern with N2D Milan, which also comes at no additional cost, and is a far superior solution than N2D Rome scoring (respectively INT and FP) +21% and +20%;
  • N2 or N2D come down to the workload whether taking advantage of Intel AVX512 or not. From a cost perspective, N2D delivers (even so slightly) higher performance and costs nearly 13% less (much more on this later);
  • Comparing C2 and C2D, we can see how AMD Milan delivers +19% and even +35% respectively for INT and FP results;
  • N2D and T2D, both using AMD Milan, are little far apart, so little that the difference could be down to the natural run-to-run variance;
  • Say we’re stuck on N1 (for instance because of a marketplace solution not certified on any newer machine family or the need for GPUs), it’s a dire situation, Skylake is the overall best but between the highest and lowest performers, there are only very few percentage points (+4% for INT and +9% for FP) and anyways these CPUs are so slow that will affect pretty much anything running on them because there is no such thing as 100% GPU offload, something is still going through the CPU, at the very least the datasets, and the loading of these will happen rather slowly. I’d like to stress that this is NOT due to Intel but rather the system setup and the chosen CPU SKU by Google (the low clock speed is an easy issue here where it goes down for each generation, with Skylake having about a 15% lower clock than Sandy Bridge);
  • It’s interesting to notice that, between the best (C2D Milan) and worse performers (N1 Haswell for INT and Sandy Bridge for FP) there is a huge +73% in Integer and +92% in Floating Point results while C2D cost 5% LESS than N1 which is both bizarre and an opportunity (the price/performance comparison, which follows at the end, is based on the europe-west4 list price).

The peak results, despite compiling the benchmark with all the possible optimization, don’t change much the picture (and this will be a trend).

  • N1 is the lowest performer and C2D is the highest performer of the bunch;
  • N2 Ice Lake is the second best, showing great INT and FP performance (and also having access to AVX512);
  • At the very least, no real benefits from AVX512 in SPEC CPU 2017 (here supported by both N2 and C2).

In case you wonder about the performance difference between Base and Peak tests, here are the numbers:

  • N1 Skylake shows a +4% INT and a +9% FP;
  • N2 Cascade Lake loses 3% points INT and gains 8% FP while N2 Ice Lake +10% and 9%;
  • N2D Rome +11% INT and +12% FP while N2D Milan +9% for both INT and FP;
  • T2A Altra +7% INT and only +2% in FP. This can suggest that either SPEC CPU is not well optimized for ARM (and this is not the case), or that much deeper compiler tuning has less effect on ARM (and here I don’t have much experience to have an opinion);
  • C2 Cascade Lake gains +5% INT and +7% FP while C2D Milan +10% INT and +9% FP;
  • With deeply optimized code, C2D is the overall top performer but not in every single category, in FP Ice Lake takes the lead on some tests including 511.povray_r and 538.imagick_r which would suggest a good fit in ray-tracing and image-manipulation while, for INT, takes the lead on 520.omnetpp_r which underline good performance in heavy network-communication applications.

SPEC CPU® 2017 multi-thread estimated scores

This table shows the multi-thread results in the Base compiler flags.

  • The pattern established before is pretty much the same as C2D is the overall top performer but this time around, the lead is much wider. While in the single-thread results, C2D Milan has a lead over N2 Ice Lake of +12% and +16%, this time around, in Multi-Thread, is +21% and +18% (always INT and FP). This is the confirmation of a better SMT implementation by AMD and a superior memory bandwidth available in Zen3. Indeed other architectural differences like bigger L3$ and a winder core also result in better results;
  • Given the lack of SMT, T2D (still based on AMD Milan) and especially T2A see among the worse performance results;
  • Don’t forget here we’re always comparing 8 physical cores (which, in all machine types besides T2D and T2A translates to 16 vCPUs). From a cost standpoint, T2D and T2A are nearly half the cost of the others. Later will follow 16 threads vs. 16 cores.

This table shows the multi-thread results in the Peak compiler flags and no major changes happen from the previous picture.

SPEC CPU® 2017 overall rank estimated scores

This graph shows the same results as before, this time it aggregates for the overall INT and FP results. Single-thread shows the IPC performance, and here the top performer is without any doubt AMD Milan (used for N2D, T2D, and C2D) and also Intel Ice Lake (used in N2).

Same as above, but for multi-thread. The results don’t change much, AMD Milan (N2D and C2D, not T2D due to the lack of SMT) achieve great results, still followed by Intel Ice Lake.

SPEC CPU® 2017 16 threads vs. 16 cores estimated scores

This is an interesting one and an eye-opener for me. Up until this point, the comparison was with the same number of physical cores. What happens when we compare threads and cores (and thus align the price tag)?

As hinted previously: double the physical cores, well, double the performance 😆

This is the single most important hidden gem of this work. Let’s take three reference systems:

  • c2d-standard-16, our top performer, is allocated on 8 physical cores and, thanks to SMT, expose 16 vCPUs, with an hourly cost of 0.79 USD;
  • t2d-standard-8, is allocated on 8 physical cores and, due to the lack of SMT, expose just 8 vCPUs, with an hourly cost of 0.37 USD;
  • And last,t2d-standard-16, is allocated on 16 physical cores and expose 16 vCPUs, with an hourly cost of 0.74 USD.

We can already see how T2D, for the same number of physical cores is priced at less than half, and, for the same number of threads, costs nearly the same. The performance figures, change, drastically, the picture:

  • t2d-standard-16 delivers +32% more INT and +22% more FP performance than c2d-standard-16 costing as little as 6% LESS. This is astonishing!
  • At the same time, t2d-standard-8 delivers -30% INT and -25% FP performance than c2d-standard-16 at about 53% less cost. Yet again, another impressive number.

As shown by the above graph, T2A Ampere Altra follows a similar trend, despite its “only” Zen2-comparable IPC, shows GREAT multi-thread performance, achieving greater Integer performance and slightly worse floating-point results, all for 15% less cost.

Remember, there are no bad products, only bad prices.

SPEC CPU® 2017 per-core throughput estimated scores

The per-core throughput results give a slightly different overall picture. While, the single-core results, by its nature, test the system at its pride (and highest clock speed), in multi-thread, first the clock speed is reduced to the all-core turbo frequency, and then, other factors affect the overall performance (cache thrashing, LLC utilization/partitioning AKA CAT, cTDP config, parallel efficiency, OS tuning, memory subsystem, etc). Under such context, the results may look different.

  • C2D Milan dominates also this chart;
  • By surprise, Ampere Altra delivers, somehow, bad results, actually the worst of the entire bunch. Some of these are so ugly that the overall performance is even lower than Intel Sandy Bridge released in 2011. Of course, SPEC CPU is known for being as much CPU heavy as memory intense. Perhaps, Intel and AMD had a much longer time to tune their IMC and prefetch/caching algorithms than ARM? There are certain details totally not reported by Google (and the cloud providers in general) such as the type and speed of memory and the DPC config. Furthermore, also the system partitioning (or lack of) can affect the performance a lot. Just look at the C2D results. The same CPU is used across three different configuration (N2D, T2D, and C2D), and it actually shines by a leading margin only on the C2D config (which is, for sure, as much software tuning as the underneath hardware setup);
  • The entire N1 stack, combined with the relatively low clock speed, delivers nearly consistent results 🙃 which is not a quality in this case. N1 is consistent bad and expensive (more on this later).

One picture worth a thousand words — Price to Performance

The above graph takes the price tag which is then divided by the sum of INT and FP results. The higher the column, the worse the price/performance ratio. Also, the red line shows deeply optimized code. Now, I have to say, if you have a workload that is heavily optimized towards AVX512, the above graph could look different, arguably very different, in every single other situation:

  • N1 should be avoided, period;
  • If you need GPUs make sure to use A2, which uses Intel Cascade Lake in the same C2 configuration;
  • AMD Milan is a great solution and the overall top performer;
  • T2D is just crushing it, and the same goes for T2A;
  • T2D/T2A, with its half vCPU, delivered comparable performance at a significantly lower cost *;
  • E2 (which we didn’t talk about much in this series) can have a good price/performance ratio, but, keep in mind, you don’t get to control the underneath CPU and, therefore, the throughput can vary from barely good to decent.

For nearly all workloads, the GCP TAU machine families (T2D on x86 and T2A on AArch64) represents, at the time or writing, the price to performance sweet spot.

*There is another caveat/benefit to underline here, Google Cloud, specifically, positions the C2 and C2D machines for deterministic and predictable performance. On top of this, a real vNUMA topology is exposed. Throughout my tests, I indeed notice a certain level of variation (something like a +/-10%) between different execution, while C2 and C2D have remained fairly constant across time. For predictability, go for C.

In the next, and last part, I’ll share the Google Sheet with all the results, wrappers, and original SPEC output logs for further deep dives.

--

--

Federico Iezzi
Google Cloud - Community

Cloud Engineer @ Google | previously RedHat | #opensource #FTW | 10+ years in the cloud | #crypto HODL | he/him | 🏳️‍🌈