“CPU-Limited” Games & Benchmarks

Osvaldo Doederlein
15 min readNov 24, 2024

--

Every CPU launch season is inevitably followed by gamers complaining about contrived benchmarks and reviewers defending their methodology. This year's output was great. I agree with the reviewers but I understand some concerns, have shared a few — on Twitter where nuance goes to die. Now I want to explore this properly, in detail, with an open mind. Later on, I also discuss the related problem of new CPUs failing to improve game performance — or modern games failing to use faster CPUs.

I planned to start quoting some frustrated gamers when a perfectly timed gift arrived. In interview to Hot Hardware about Intel’s poor Arrow Lake-S launch, Robert Hallock also opines:

“For me as a personal buyer, these [1080p] benchmarks are totally irrelevant… I have a 1440p ultrawide and two 4K monitors… I have to go out of my way to find those results in reviews, that is more relevant to how I use my PC.”

Check the video for context: that comment was invited by the interviewers; Hallock never implies that reviewers are wrong; and he’s not dismissing the tests in ARL-S’s launch reviews. But it's a great example of many gamers’ frustration with benchmarks that seem disconnected from real-world play.

Reviewers Make their Case

Let's start with Daniel Owen’s recent series of videos. Daniel’s teaching skills deliver helpful diagrams and explanations: see The Bottleneck Paradox and the follow-up Are CPU reviews all WRONG?. For me, this is a great way to understand this kind of problem. A good abstract model represents a generalized behavior well, unlike a set of results that might be cherrypicked or incomplete.

Next we have Ancient Gameplay’s THIS is how to PROPERLY read the “1080P” CPU benchmarks! Resolution doesn’t matter. Fabio Pisco makes a similar argument as Daniel but they are illustrated with benchmarks, which makes things concrete if that's how you are better convinced.

I will highlight a minor point here, made by multiple reviewers but Fabio gives me the perfect screenshot: the popularity of 1080p as a real-world gaming resolution. Even to the extent that the Steam survey is useful with its many well-known deficiencies, it only shows single-factor breakdowns. What are the dominant display resolutions for users of recent, midrange or better hardware? I find it hard to believe that the majority of gamers rocking say a 13700K use a 1080p monitor. Even more if we exclude monomaniac players of competitive multiplayer games. By the way, I just learned from Monitors Unboxed that 1440p is the new 1080p.😉

That takes us to their main channel Hardware Unboxed with two recent pieces. In How We Test: CPU Benchmarks, Misconceptions Explained (video), HUB makes their case looking back at a 2017-era review where they had compared CPUs using then-halo GPU GTX 1080ti, but now re-resting with a midrange (for 2017) GTX 1070 plus some now-current GPUs and CPUs.

In the left chart, even at 1080p, pairing those older CPUs to the “balanced” GTX 1070 would greatly underestimate the performance gap not only with a contemporary halo GPU but also with the future entry-level RTX 3060. On the right, testing the same game at 4K, even with an RTX 4090 most of the performance delta disappears for other CPUs. Very interesting approach that also shows evidence of the predictive value of testing with 1080p.

The same article/video has similar charts for a few other games and in all others a 4K test shows more difference. But the selection is small. Their latest, Ryzen 7 9800X3D, Really Faster For Real-World 4K Gaming? has a better selection of 14 games, latest hardware, and they run the 4K tests with upscaling — how the majority of users with 4K monitors play games.

We can find a game like ACC where 4K/Balanced shows great CPU scaling. ACC is the best-case outlier but see also Jedi Survivor, Hogwarts Legacy, Homeworld 3, CS2, Space Marine 2, Watch Dogs Legion — 50% of all games in their test suite have significant gains at “4K”. Still the gain is even higher at 1080p, not to mention other tests where only 1080p reveals an advantage. Overall, HUB explains well why those 4K tests are too much extra work and not really needed to evaluate CPUs even when they show framerate gains.

Even if effort and time were not problems, the need of upscaling for really-real-world 4K testing is a problem since upscalers have a GPU cost so the test is not as good as native tests at the same rendering resolution in terms of ensuring CPU focus. And each choice of upscaler X scaling factor has different GPU costs. Those factors can add a lot of noise to the results. I'll rather have dedicated upscaler benchmarks when some of them are updated or when a new GPU can run its home-turf upscaler better.

Synthetic Benchmarks

A CPU-limited test is necessary to isolate the Subject of Test and measure its performance without the variable of GPU performance. That’s standard engineering practice for benchmarking and for testing in general.

This is a great argument for skewed testing configurations, but it's also a stronger argument for proper benchmarks: code that was designed to be a benchmark. In previous posts I made a case for synthetics but those are all in the GPU-centric category. So if you try those benchmarks for CPU testing, you get results like this from Craft Computing:

Some gaming benchmarks include CPU-load subtests but they’re few and weak; for example, 3DMark Time Spy’s CPU is just one test, a Boids physics simulation. There are many great synthetic CPU benchmarks… but none is designed for games. By the way the best “synthetics”, contrary to popular belief, use code taken from real apps; they are not microbenchmarks (small hot-loops of code written especially to be a benchmark). But there are exceptions even among 3DMark’s CPU tests.

My pledge: Please UL, un-deprecate the 3DMark API Overhead test. This benchmark is heavy on the CPU (see above / Ryzen 9 9900X) and virtually all CPU-side work it’s doing is in the GPU driver. This is a great test of multithreaded CPU efficiency for modern DX12 or Vulkan games with a heavy load of draw calls. We only need some tweaks to make the test less GPU-intensive if possible so the test is always CPU-limited.

With this test performance depends on the GPU driver as much as the CPU but we want that as one element of “CPU for gaming” reviews. All real-game benchmarks already depend on this factor so scores can only be compared between tests locked to the same GPU driver version as well as game patches, BIOS version, or Windows updates.

Games are Not Special?

Shouldn’t we simply test CPUs with proper CPU benchmarks, even if those aren’t gaming-specific? I can over-simplify that argument as a meme to piss off every reviewer, and then try to make the best case for it.

The argument: Games are not special snowflakes with code that doesn’t look anything like any other application. They tend to live in the extreme of low-level code optimization but that's true for plenty of other software: OS kernels, database servers, engineering, media processing. And there are many benchmarks that specialize in measuring that kind of code but usually labeled as HPC, server, encoding, simulation, etc.

In the extent that a particular game can be a snowflake, reviewers have the problem of selection: they cannot test every game or even every popular game and nobody knows what’s in most games’ CPU code.

There's little science in picking games and scenes for benchmarks. Yet, there are good criteria: games that deliver very reproducible results, sampling major genres and engines. But also some concessions: popular games the audience wants, games that are easier to use as a benchmark. Overall, well-honed empirical methods can work well. That methodology seems weak if compared to benchmarks that are based on research about the representativity of coding techniques, dataset sizes, concurrency etc. And designed with help from CPU, OS, and compiler makers. And open to peer review. That said this is a hobby and the content must be fun to watch and relatable. If I want to see scientifically rigorous performance results I’ll check ICCD or SIGGRAPH papers, with a Math textbook at hand.

Having argued for both sides, perhaps we can ask for a better balance. Most CPU-for-gaming reviews include productivity benchmarks like 7-Zip or Blender so why not add one good CPU synthetic; for example, SPEC2017 or y-cruncher. Something like Cinebench looks cool in a video and is popular but shouldn’t be your primary test for CPU IPC, multithreading, analyses of clocks, power or thermals. Reproducibility is also helpful: a few reviewers make available videos of full benchmark runs or save-game files so anyone can try and test the same scenes.

Flopping Games

If you go by most gaming reviews 2024 was mostly disappointing for new CPUs, but… have you considered that it’s the games that “flopped”, not the new hardware? Maybe it's not the new CPUs that fail to increase gaming performance, it's the games that fail to take full advantage of better CPUs.

This section will focus on Zen 5. Intel's ARL launch is still fresh as I write, its performance should improve somewhat with promised firmware fixes but it is not likely to become a good generational upgrade for games either.

Reality check: Zen 5’s pretty good IPC gains, around +15%, are proven by a vast number of CPU benchmarks. I'm not even overplaying the AVX-512 card, that takes Zen 5’s gen-over-gen average close to +30%.

Why is that we didn’t see a big gen-on-gen improvement in for games? Well, for one thing the typical, recent-ish (2020+) game EXE/DLL in your PC is compiled to be compatible with 2012-era Sandy Bridge or Bulldozer. Brand-new AAA titles I see now list 2017’s Coffee Lake and Zen 1 CPUs as minimum. And that could just mean “needs this much CPU power”, not “well optimized for this CPU or newer”. Developers want to target a wide audience, and consoles establish a red line — not much effort, I suspect, goes into CPU-specific optimization that’s no good for the major last-gen consoles.

Games can have multiple code paths: if we have AVX2 do this, otherwise do that slower but more compatible code. Some of that can be provided gratis by libraries including the C/C++ standard libs, although games often avoid most third-party code for reasons such as avoiding heap allocation. And per-CPU optimizations might only benefit a minority of consumers.

In Extreme SIMD: Optimized Collision Detection in Titanfall, see how this game didn’t use SSE4 or AVX1 available on low-end!!! PCs circa 2014 despite developer Respawn putting in the hard work for CPU SIMD optimizations. They could have written separate code for better CPUs. But they didn’t, and that’s the norm. That’s why PC-gaming reviewers dismiss the utility of AVX-512 in Zen 4+: it is not likely that any game uses that ISA today, or will do for years more since Intel dropped that ball in client SKUs for generations.

Compilers can help: just use the right flags like -march=x86–64-v4 and the code generator spits out better code for modern CPUs. But this is limited; compilers are still quite dumb in areas like auto-vectorization. Even to the extent that compilers help, games rarely use that because producing separate EXEs/DLLs for each CPU level creates complications from building to testing and support.

Why is that synthetic CPU benchmarks and even many real-application benchmarks vastly outperform games? In part because they’re optimized for the latest CPUs. Many are distributed as source code that each reviewer can — should — build with the latest compiler and optimal flags. Others are professional applications that have those optimizations. For example, we can find an Intel whitepaper revealing that Adobe Lightroom used AVX2 & AVX-512 in 2019 and Intel helped optimize that. (Their target users are expected to buy Xeons; SIMD-lobotomized CPUs are for peasants.)

The typical AAA game is not like Lightroom. The final few months of development are often a crunch to ship something that kinda works. Then 1–2 months of patches, hopefully with minor performance fixes. Then the team moves on to the next project and that code is never touched again. Unless it's the next GaaS hit: years of patches… adding more crap to buy. Incidentally, one reason I love the modern trend of releasing Remasters and Remakes of classic games is that these can be great opportunities to focus on technical updates, catching up with new CPUs and GPUs.

Games are special snowflakes in one sense — no direct competition. When Elden Ring’s performance is garbage I can choose to play a different game, perhaps even another “souls-like”; but there aren’t straight replacements for the same game in the way that you can choose another photo editing app, database etc.

If all the above is true, why is that previous CPU generations succeeded to score better gains for game benchmarks? One big ticket is where the CPU bottlenecks are exactly. Single-threaded scalar performance always helps but that’s getting harder every new generation. Game engines are now very bandwidth-intensive so faster RAM and bigger caches help. Bandwidth improved a lot in recent generations: with the transitions of DDR3 → DDR4 → DDR5, with much bigger caches culminating in AMD’s X3D. See right now the 9800X3D again making most games a lot faster.

But these factors are flattening. AMD's X3D did not increase in size since Zen 3, SRAM is not scaling well, DDR5 is mature with actual performance moving very incrementally, GDDR progress is even slower. The economics-driven need for chiplets cost in latencies, now also for Intel. The 9800X3D's average +11% over the 7800X3D owes a lot to the new below-CCD packaging unlocking higher clocks, a one-time fix. Maybe the next iteration will add a second layer of cache but there's a curve of diminished returns for cache size and size/latency tradeoffs. The low-hanging fruit have been picked; now the only paths forward are uphill at night and snowing.

Games and CPUs: The way forward

If you want your future AAA games to scale to those new 480Hz monitors and beyond, we'll need help in CPU scaling. The most obvious opportunity is better multithreading. This has improved a lot since the bad ol' days of mostly single-threaded games; now, modern engines often make good use of CPUs up to 6-8 cores. But if you have a higher-end CPU you never see anything even close to all-core full load. In fact, even with 8 cores a true 100% CPU load is extraordinarily rare.

The test above from CapFrameX is close to a best-case scenario. This is the UE5 City Sample but you rarely see any game with better CPU usage. Notice the Intel 285K doesn't have hyperthreading, simplifying scheduling and accounting of true load. The engine does a very good job with high load on all the P-Cores. It also puts significant work on the E-cores, which unlike previous Intel CPUs are not useless for games.

But there's no full load. In yet another Hardware Unboxed video, Editor and Steve Impersonator Balin McKinley mentions that gamers often don't get how a game can be CPU-limited if load is way below 100%. Balin offers memory bottlenecks as an explanation, but there's way more.

Games — in fact most code — are very difficult to break down into lots of tiny tasks that can be executed in parallel with good scaling and constantly high utilization of hardware. Most of the game code that's embarrassingly parallel is already executed on the GPU side. What remains for the CPU is by definition the work that's hard to parallelize, or needs OS services such as I/O, or just unsuited to GPU architectures e.g. too “branchy”.

Gaming enthusiasts and reviewers love to complain of bad optimization. But most modern engines already make a big effort to use many cores. In the GDC presentation Parallelizing the Naughty Dog engine using fibers, we can see that in detail for one such engine.

Despite extensive optimization for CPU parallelism with a new engine that issued ~1,000 jobs per frame in TLoU Remastered, rendering was still too slow on the 6 CPU cores available for games in the PS4. The red ellipses are big holes of utilization in each core. The largest problem: dependencies between the left-side game logic phase (e.g. scene management, input, AI, physics) and the right-side rendering logic (e.g. culling, LOD, draw calls).

Naughty Dog solved that by interleaving frames: the game logic phase for frame N can overlap the rendering phase of frame N-1, and also overlap the GPU rendering for frame N-2, a third phase not shown in the slide. This trick managed to fill the big holes, keeping frame times below 16ms to hit 60fps while allowing each of the three phases to take as much as 16ms. However, this solution has a cost in extra input-to-pixel latency that can be as bad as 2 frame-times. It also greatly increases memory pressure; half of that presentation is just about memory management optimizations needed to make the overlapped rendering viable.

If you know the reputation of the TLoUp1 port for PC later, my gushing over this engine looks odd. But check above: on the left, from Digital Foundry’s launch review, CPU usage on a PS5-equivalent-ish 6-core Ryzen 5 3600. They describe that as “incredibly impressive” CPU utilization. But the center graph is from DF’s same review with an Intel Core i9 12900K and the graph on the right is PCGH’s proper benchmark on a Ryzen 7 7800X3D. The two latter tests have GPU usage in the mid-80's, the game is clearly CPU-limited but overall CPU usage is far from full load again.

The PC is a very hard target, even for an excellent game engine. You want:

  • Both low concurrency overhead for entry-level PCs with 6 cores and super-fine-grained concurrency to take full advantage of a halo CPU with 16-24 cores.
  • Much higher expectations of framerate, rendering resolution, and rendering quality. All at the same time. Including features such as more advanced Ray Tracing that can be heavy on both CPU and GPU.
  • Run on an Operating System that is not specialized in gaming. Some extra overhead for good reasons such as security, or support for an infinitude of third-party hardware, or backwards compatibility going decades back. Some for bad reasons, mostly feature bloat.
  • Incredibly more complex memory hierarchy. PCs have a vast range of memory sizes and speeds, cache size, monolithic vs chiplet packaging with different latencies and internal buses, and the need to constantly pump gigabytes of data from DRAM to segregated VRAM.
  • Significant extra CPU burden for asset loading and decompression, while current-gen consoles have dedicated hardware for that.

The final two items are not simply a problem because they consume some CPU cycles, they also create opportunity for new CPU and GPU utilization holes if some jobs have to wait longer for data they need. To make that worse, PC gamers using keyboard & mouse are much less friendly to any tradeoffs for input latency. In fact we double down on low latencies with technologies like NVidia’s Reflex and AMD’s Anti-Lag 2.

Summary

This stuff is complicated. Faster CPUs might fail to make games faster because most improvements we get these days require extra effort from game code for full benefit and to scale across a wide range of hardware. Higher clocks and faster memory access help most games, but you can’t get a lot of those raw-speed improvements in every CPU generation.

Reviewers and PC enthusiasts punish games that perform well on the average but have poor 1%-Lows or rare stutters, while gamers with high-end PCs feel entitled to 4K/Ultra Locked 120fps minimum. Modern games have been forced to become sophisticated multithreaded engines and Soft Real-Time systems with increasingly tight latencies. The complexity of these engines has two major consequences. First, fewer studios can afford to make and maintain these things, resulting in consolidation into a few big general-purpose engines like Unreal. Second, complexity begets bugs and this is part of the reason that many AAA releases are buggier than older games, not only at launch but even after significant patching.

--

--

Osvaldo Doederlein
Osvaldo Doederlein

Written by Osvaldo Doederlein

Software engineer at Google. Husband, Father. Likes science fiction, gaming, PC hardware, tech in general.

No responses yet