Intel Sapphire Rapids on Google Compute Engine C3 instance and SPEC CPU® 2017

Published in

Google Cloud - Community

7 min readMar 3, 2023

The newest GCE C3 machine family represents a major milestone for Google Cloud, definitely more than just an incremental upgrade in terms of computing infrastructure. CPU-wise, C3 is powered by the 4th Gen Intel Xeon Scalable processor (codename Sapphire Rapids). Along with the generational improvements that Sapphire Rapids brings, Google, together with Intel, has developed a custom IPU (Infrastructure Processing Unit) enabling new kinds of workloads thanks to the capabilities available.

Intel Sapphire Rapids

Sapphire Rapids is not just the latest Intel architecture, but it shifts Intel’s approach to make CPUs thanks to a more modular approach with several tiles interconnected by EMIB. Sapphire Rapids is also characterized by a plethora of hardware offloads (many of which may be available on GCP) for specialized workloads.

As with any true new generation, there is a healthy uplift of improvements, such as a new CPU architecture based on Golden Cove, supported by winder and faster DDR5 memory subsystem and PCIe 5.0 which, in turn, enables CXL.

The best place to learn more about Sapphire Rapids is ServeTheHome, where Patrick has written a deep dive into capabilities and performance.

Intel IPU E2000

This is the next piece of the C3 recipe. These things used to be called SmartNICs, then DPUs (Data Processing Unit — arguably, that’s the correct term also here), and more recently IPUs (Infrastructure Processing Unit). The theory goes that a SmartNIC does much more than just handling Ethernet packets, originally highly specialized offloads were baked into fixed hardware functions (such as MPLS and VXLAN offloads), then some aspects of the hardware became programmable (e.g. in-hardware packet filtering, a programmable pipeline for specialized offloads), but ultimately the needs for wasting as little resources as possible, freeing up the expensive host CPU cores, required to go further and the DPU were born where the NIC is literally a mini-computer capable of exposing specialized devices such as NVMe and Ethernet ones all ensuring line-rate performance and programmability for (pretty much any) given scope.

But not just that, a DPU also has some general purpose cores, typical ARM or MIPS-based (take a look at the Mellanox Bluefield), where other, more complex and less generic tasks can be offloaded. What Intel and Google did, to me, resembles a lot of the AWS Nitro idea, where the physical host is just a bunch of CPU and memory resources, controlled by the DPU. I don’t have any specific Nitro knowledge, but it seems E2000 has all-around support for other use-cases like RDMA, cryptography, and support for community-driven frameworks like DPDK (Intel also quote SPDK and IPDK but I don’t have any first-hand experience there).

For all the in-depth details, please refer to this awesome Medium post by Intel:

Intel® IPU E2000: A collaborative achievement with Google Cloud

Author(s): Patty Kummrow, Corporate Vice President and General Manager, Network and Edge Ethernet Products Group

medium.com

Performance improvements

Google quotes a 20% uplift when compared with the Cascade Lake-based C2 instances, and an astonishing +80% for block storage (that is as much Hyperdisk related as thanks to the IPU E2000):

Introducing C3 machines with Google's custom Intel IPU | Google Cloud Blog

Editor's note: Originally released to Private Preview in October 2022, C3 VMs became available in Public Preview on…

cloud.google.com

I’d like to focus on three performance areas:

General purpose SPEC CPU® 2017;
Network latency;
Storage performance.

SPEC is the subject of this first document while Network and Storage will follow.

SPEC estimated scores

Before going into the performance number, I’d like to point out that all SPEC CPU® 2017 numbers are estimated scores. This is due to the lack of official submission and review by the SPEC body.

General purpose SPEC CPU® 2017

Just like in the previous SPEC CPU challenge Part 1, 2, 3, and 4, this C3 preview follows the same format and all data is recorded in a Google Sheet, which is available at the following link:

[SPEC CPU® 2017][GCP][medium] - Mar 3rd 2023

Credits Author,Federico Iezzi E-mail,fiezzi@google.com GitHub, https://github.com/m4r1k Twitter…

docs.google.com

Now, right off the bat, there are four topics to touch upon:

The GCP C3 Machine Family is currently under PREVIEW, which means capabilities and performance can change before the GA release;
Bugs and unexpected behaviors are part of the PREVIEW experience, speaking of which, I was unable to run the SPEC Peak tests due to an error when compiling the codebase with -O3 -march=native which enables AVX512 optimizations. Follows the error: Error: no such instruction: `vmovw %eax,%xmm0';
All the platform characteristics are the same ones used by the former challenge described above, the only exception being the RHEL 9 kernel which has got a slight bump in minor release (5.14.0–162.6.1.el9_1.0.1.x86_64vs. 5.14.0–162.12.1.el9_1.0.2.x86_64). If you’re curious about the differences, see the following changelog -> https://access.redhat.com/downloads/content/kernel/5.14.0-162.12.1.el9_1/x86_64/fd431d51/package-changelog (Red Hat subscription required);
I wrote a new PKB patch to support the visible cores option of GCP. This was instrumental to ensure to model down the c3-highcpu-22 into the usual layout of 16 vCPUs and 8 pCPUs.

[GCP] Support GCE Visible Cores · Issue #4000 · GoogleCloudPlatform/PerfKitBenchmarker

I wrote the following patch to add support for the GCE visible cores feature, which is pretty useful when running…

github.com

Here follows the SPEC results:

We can see how C3 is, across the board, a better option compared to C2;
The geomean results for single-thread suggest a +12% INT and +23% FP performance improvements, which is a pretty healthy jump;
The geomean results for multi-thread suggest wider improvements with a +20% INT and +24% FP. This may be due to the better memory subsystem (DDR5 vs. DDR4 and 8 vs. 6 memory channels) and different cache layouts (2 MiB L2$/core vs. 1 MiB L2$/core and 1.875 MiB L3$/core vs. 1.375 MiB L3$/core) of Sapphire Rapids compared with Cascade Lake. This is somewhat in line with the results provided by Google, although it is admirable to see that we can gain more performance than what’s officially quoted;
Nearly no regressions, besides 554.nab_r (important for science computation) all the others can be categorized as a run-to-run variation.

Follows the percentage of performance uplift for both single- and multi-thread:

When throwing into the mix also C2D, the final result is a bit different:

The geomean results for INT see C2D AMD Milan at a leading margin of about +12% over C3 (and +19% vs. C2);
Sapphire Rapids in INT takes the lead on 500.perlbench_r and 520.omnetpp_r respectively representative of Perl, used in many utility under-the-hood in Linux (although not recent ones nowadays) and for large 10Gbps network transfers;
A similar story follows for FP results, where the geomean shows a C2D AMD Milan at +23% over C3 and +35% vs. C2;
Sapphire Rapids in FP takes the lead on 507.cactuBSSN_r, 508.namd_r, 510.parest_r, 511.povray_r, and 538.imagick_r. These are important for many mathematical and scientific simulations as well as image manipulation and rendering.

The multi-thread results are a bit different, let’s get to it:

C2D AMD Milan is still the best overall performer, then;
Especially in the INT results Sapphire Rapids sees a bigger jump in performance in 502.gcc_r, 505.mcf_r, 523.xalancbmk_r, 510.parest_r, 527.cam4_r, and 554.roms_r. These are important for the GNU C Compiler, pretty much used by the majority of the open-source communities and many closed-source ones, arithmetic, XML to HTML translation, imaging, and fluid dynamics (which may imply some nice CFD results);
C3 is only able to take the lead over C2D for INT in 500.perlbench_r and
523.xalancbmk_r and for FP in 511.povray_r and 538.imagick_r;
For FP results, again, we have great results for image manipulation and rendering.

Follows also the SPEC-benchmark comparison between C3 and C2D:

Time to sum up the SPEC results

It’s time to summarize the SPEC results, and overall, the C3 represents a significant step in the right direction, with notable improvements in pure performance. Additionally, it boasts the most comprehensive set of AVX512 instructions available, making it an appealing option for those looking to take advantage of newer AI, ML, and deep learning accelerators, such as, AMX and bfloat16.

While I have yet to analyze the Storage and Network results, I am confident that the GCP C3 is an excellent choice for any software that can leverage the features provided by Sapphire Rapid. On the other hand, the absence of AVX512 in C2D is a limitation that cannot be ignored, even though it is subjective and workload-dependent.

However, AMD Milan in the C2D variant delivers exceptional performance results and still dominates the top of the charts. Moreover, it comes with a very competitive price tag.

In the upcoming chapters, I’ll discuss the network and storage results in detail. Until then, I hope you have found this information useful so far.