[Updated] HPC Leadership Where it Matters — Real-World Performance

Steven Collins
Nov 5 · 7 min read

[Update on November 6] We received feedback on our original blog and appreciate the community’s passion about performance and the accuracy of benchmarks. Intel is committed to always provide fair, transparent, and accurate performance results. Taking the community’s feedback, we have updated this blog with data for the most recent GROMACS 2019.4 version and found no material difference to earlier data posted on 2019.3 version.

On the GROMACS results, our testing used the 2019.3 version released in June with best known optimizations applied to both platforms. This included proactive enabling of 256b AVX2 for AMD. Since our original testing, an updated version of GROMACS 2019.4 now automates the AMD build options for their newest core, including autodetecting 256b AVX2 support. We have now tested using GROMAC 2019.4 and found no material difference to the performance geomean of the five GROMACS workloads (difference of 1.08%). The 2019.4 results are in-line with our previous 2019.3 results.

There are setting differences between the platforms. These differences (e.g. threads per core, turbo on/off, Intel SNC, and AMD NPS) are intentional to achieve the highest performance for each platform. We also reviewed our published configurations and found an unintended typo. The configuration originally showed one thread per core for AMD, when our testing in fact used two threads. For Xeon 9200, the configuration previously showed two threads per core for all workloads, when in fact two of the five workloads used one thread per core. The corrected configuration details are listed below.

We appreciate the passion and engagement of this community for bringing issues to our attention. Intel is committed to always provide fair, transparent, and accurate performance results and would not intentionally mislead. We will continue to share our latest technology developments and findings with you.

Updated configuration details:
GROMACS 2019.4: Geomean (5 workloads: archer2_small, ion_channel_pme, lignocellulose_rf, water_pme, water_rf): Intel® Xeon® Platinum 9282 processor: Intel® Compiler 2019u4, Intel® Math Kernel Library (Intel® MKL) 2019u4, Intel MPI 2019u4, AVX-512 build, BIOS: HT ON, Turbo OFF, SNC OFF, 2 threads per core for: ion_channel_pme, lignocellulose_rf, water_rf. 1 thread per core for: water_pme, archer2_small; AMD EPYC™ 7742: Intel® Compiler 2019u4, Intel® MKL 2019u4, Intel MPI 2019u4, AVX2_256 build, BIOS: SMT ON, Boost ON, NPS 4, 2 threads per core.

Conventional wisdom typically suggests that “more is better”; more time, more money, more horsepower while driving on the autobahn. However, it’s important to take a holistic look to determine if “more” is always “best.”

Datacenter operators and researchers — especially those involved with high performance computing (HPC), are some of the most demanding users of technology. More performance at their disposal can accelerate the process of solving some of the world’s toughest challenges from weather simulation to drug discovery to improved safety.

Just like adding more people to a meeting does not always lead to greater productivity, “more cores” will not always guarantee “more performance.” Performance is a factor of many things, not just a single vector. More processor cores add compute, but overall system or workload performance depends on other factors, including:

· The performance of each core
· Software optimizations leveraging specific instructions
· Memory bandwidth to ensure feeding of the cores
· Cluster-level scaling deployed

HPC-Optimized Performance

To address the insatiable demands of HPC and the need for higher application performance, we introduced the Intel® Xeon® Platinum 9200 processor family in April 2019. Xeon Platinum 9200 is targeted for the most demanding compute and memory bandwidth workloads. Using the high-performance Xeon Scalable core, it not only improves compute density with double the cores but also doubles the memory bandwidth(1), enabling nearly all HPC software to see performance gains. It has the highest two-socket Intel architecture FLOPS per rack along with the highest DDR4 native bandwidth of any Intel Xeon platform. The Xeon Platinum 9282 offers industry-leading performance on real-world HPC workloads across a broad range of usages(2).

For a quick recap, the Xeon Platinum 9200 consists of two Xeon dies in a package with 4 UPIs per socket to ensure only a single hop between any two dies in a 2S system. There are multiple SKUs available, ranging from 32 cores to 56 cores per processor with TDPs of 250W to 400W. Each processor has 12 DDR4 memory channels. The Xeon Platinum 9200 is available as an integrated solution, the Intel® Server System S9200WK data center block for HPC. This allows system providers to easily configure a custom solution for end customers with minimal effort to adopt this new processor.

The HPC segment is broad with varying compute requirements by workload. 56 core Xeon Platinum 9282 ranges from 8% to 84% better performance (31% higher geomean) than AMD’s 64 core Rome-based system (7742) on leading real-world HPC workloads across manufacturing, life sciences, financial services and earth sciences(2).

Some of the applications and results shown above are a geomean of several specific workloads, all with differing characteristics and sensitivities. Drilling into the details of these workloads provides further insight into performance. For example, Xeon Platinum 9282 leads AMD Rome 7742 by 13% on a geomean of 14 ANSYS® Fluent® workloads. Across those 14 different CFD simulations, Xeon’s results range from 2% lower to 36% higher(2).

The performance of specific applications is sensitive to different attributes. For example, AVX-512 are 512-bit extensions to Intel’s instruction set architecture (ISA) available in Xeon Platinum 9200 and other Intel Xeon Scalable processors. AVX-512 increases the vector width that allow applications to leverage more floating point operations per clock cycle. Several HPC applications take advantage of AVX-512 and see a performance boost, including VASP, NAMD, GROMACS, LAMMPS, and FSI applications. Some HPC applications are compute bound, while some are memory bound. Others are both. Depending on the bottleneck, simply increasing only compute or memory bandwidth may not result in higher performance. Not only does Xeon Platinum 9200 increase compute with more cores, but also increases memory bandwidth with more channels, and includes AVX-512 extensions for software developers to take advantage(1).

Higher Performance with Lower TCO

More application performance absolutely has value, but the price paid for that performance also matters. Cluster-level total cost of ownership (TCO) is a function of several factors: performance of each node, number of nodes required to complete a job, cost of the interconnect (fabric, switches, cabling), operational costs (e.g. power, space), and software.

In general, higher node performance drives lower TCO as fewer nodes are required for a fixed performance level. With the added performance of the Xeon Platinum 9200, fewer nodes are required, driving lower node acquisition cost, lower fabric, switching, and cabling cost. Xeon Platinum 9200 series, with a higher TDP (250W to 400W) than AMD’s Rome 7742 processor (225W), does consume more power and therefore has a higher power cost, but it is more than offset by the lower number of nodes required. The TCO for any HPC user is a complex question, often unique to their specific applications, infrastructure, and cost structure. As with performance, we believe a holistic view and evaluation of TCO is needed with performance as a key driver.

There are many factors to consider when selecting the right processor to power their high-performance computing system. While adding cores may improve compute on some applications, overall performance and TCO is a factor of multiple attributes. More processor cores do not always translate to higher performance, and nor do more processor cores always translate to better TCO. For decades, Intel has worked closely with our HPC ecosystem partners to ensure they have the right platforms that best meet their system requirements. This fact is demonstrated by looking at the number of Intel-based systems on the Top500 list of the world’s most powerful supercomputers.

Industry Adoption

Customers choose Intel because of the value our Xeon platform delivers, and the Intel Xeon Platinum 9200 is no different. Ecosystem partners include Atos, HPE/Cray, Lenovo, Inspur, Sugon, H3C and Penguin Computing. HPE recently announced their Apollo 20 server featuring the Intel Xeon Platinum 9200 processor, which is targeted for use in data-intensive industries, including oil and gas, finance, manufacturing, and life sciences. Penguin Computing is currently building a Xeon Platinum 9200-based system at Lawrence Livermore National Laboratory, which I’m happy to say will be unveiled at Supercomputing 2019 (SC’19), and HLRN (North German Supercomputing Alliance) announced in April 2019 they have selected Xeon Platinum 9200 for its next-generation supercomputer to enable significant computation gains and improved efficiency.

1 — Cores and memory bandwidth comparing Intel Xeon Platinum 9200 with Xeon Platinum 8200. See www.intel.com or Intel® Xeon® Scalable Processor product page for details.

2 — For configuration details about the performance of the Xeon Platinum 9282 vs AMD Rome 7742, visit www.intel.com/benchmarks (Intel® Xeon® Scalable Processors — claim #31).

Legal Notice

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Refer to https://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.

Intel Advanced Vector Extensions (Intel AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Performance at Intel

Intel’s blog to share timely and candid information specific to the performance of Intel technologies and products. Intel’s fellows and engineers will also use this blog to share their latest technical updates and discuss how they are pushing performance forward.

Steven Collins

Written by

Intel Datacenter Performance Director

Performance at Intel

Intel’s blog to share timely and candid information specific to the performance of Intel technologies and products. Intel’s fellows and engineers will also use this blog to share their latest technical updates and discuss how they are pushing performance forward.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade