The case for Synthetic GPU Gaming Benchmarks

Osvaldo Doederlein
7 min readJan 11, 2023

--

There is a constant debate among PC hardware & gaming enthusiasts about the merits of GPU benchmarks and the reviewers that perform them. Meta-analyses that average the scores of several outlets help, but they mostly dilute any errors or outliers in the aggregate score. And I’m not convinced that all this effort and confusion are the best way to rank gaming hardware. I've taken first steps discussing this issue and proposing alternatives here and here, but more was needed.

The problem

Quoting myself in a recent Twitter thread (edited for concision & clarity):

"Each reviewer benchs in different ways. See Watch Dogs Legion 1440p/raster. HUB: 4070ti 115fps; 7900 XTX and even the XT beats the 4090. TPU: 4070ti 146fps; 4090 beats the 7900 XTX by 14%. MASSIVELY different results. This game has a built-in benchmark, HUB uses it but TPU refuses to use it. Also HUB tests on a 5800X3D + DDR4/3200, vs. TPU’s 13900K + DDR5/6000. If the difference of CPU wasn’t bad enough, HUB limits DDR4 speed to the official JEDEC max, but TPU doesn’t (the max for DDR5 is 5600)."

You can argue one of the reviews is just wrong, but your opinion is as good as mine. For one thing, there's not enough transparency — reviewers don't share full video capture of every benchmark or automation scripts; there is no direct and objective way to judge the quality of their work.

The solution: Synthetic benchmarks?

There are specialized benchmarks that represent gaming performance with a focus on GPU rendering. Let's keep it simple and only mention 3DMark, in my opinion, the best in this category by far.

3DMark contains two kind of tests. Some are microbenchmarks of isolated features such as Mesh Shading. Others are synthetic gaming benchmarks that should represent the rendering engines of typical games. Again I will keep it simple and only consider the latter.

Writing one benchmark that fairly represents every game is a tough call, but 3DMark contains some variety. Let's list its “game-like” tests, with their scores on my system and also the scores of real games I chose that have comparable rendering engines. All scores for 1440p since that's the single resolution supported by every test.

  • Fire Strike Extreme: DX11, Raster. 173fps.
    Batman Arkham Knight: 235fps.
  • Time Spy: DX12 L11, Raster. 175fps.
    Red Dead Redemption 2: 141fps.
  • Port Royal: DX12 L12, Raster + RT (reflections & shadows). 69fps.
    Shadow of the Tomb Raider: 96fps.
  • Speed Way: DX12 Ultimate, Raster + RT (GI, PBR). 58fps.
    Dying Light 2: 66fps.
  • DirectX RT: DX12 Ultimate, Pure RT (path-traced): 50fps.
    Quake II RTX: 75fps.

This selection covers the last 10 years in Windows gaming. They are limited to a single API, but I find that OK: performance-wise DirectX and Vulkan are extremely close; nothing else has been relevant for many years; and in a benchmark, we don’t want any overhead of abstraction necessary to support multiple back-ends for different graphics APIs.

Real games track the paired 3DMark test quite well. A single caveat: Quake II RTX seems too fast. This game is too old and relatively easy to render on the latest GPUs, even with path tracing, due to its very limited geometry. But there's nothing else. Portal with RTX is a poor choice for many reasons — no built-in benchmark, too experimental, rendering bugs on RDNA GPUs, next-gen path tracing beyond the 3DMark DXR test. Not a big deal yet since path tracing is exclusively found in remasters of ancient games, produced by modders or by GPU vendors. I'm ignoring that fifth test.

I’m focused on reasonably current technology (last 10 years); that excludes ancient esports titles that run at many hundreds of FPS on the latest GPUs. If you care about that, 3DMark’s Night Raid is a good extra test despite the fact that it uses DX12 (DX9 would be ideal) and contemporary rendering techniques. This one is targeted at laptops or tablets with integrated graphics, but it’s just lightweight; its graphics tests average 850fps at 1440p on a 7900 XTX. I’m not using this test either. We should stop judging new hardware by games that almost require emulation to run.

Meta-analysis: 3DMark versus 3D Center

How well do those four 3DMark tests track real-game benchmarks from major reviewers? I decided to make a meta-benchmark, here's how.

First, I chose six GPUs, representing the last two generations from the two major vendors: 6900 XT, RTX 3090, 7900 XT, RTX 4080, 7900 XTX, RTX 4090. I rank the GPUs by their Graphics scores in each 3DMark test, normalized to the 6900 XT (so that GPU always scores 100%). I average the four tests into a single ranking per GPU, also scored relative to the 6900 XT.

Then I take data for the same GPUs from the 3D Center Launch Analysis, which averages many reputable reviews. They update this meta-analysis frequently, I'm using the update for the 7900 XT & XTX because it's the most recent that includes all GPUs of interest.

In the 3DMark database, I'm filtering the scores for CPU=13700K. This is one of the top gaming CPUs, good to limit CPU bottlenecks; but it's more popular than the halos so the 3DMark database has way more entries. I also filter only results with the CPU clock between 5.3GHz-5.5GHz to exclude outliers like extreme overclocks or poorly-configured systems. The goal is good distributions, like the picture above. I collect the Graphics scores, not the Overall scores of some tests that include CPU-bound subtests.

With this method, the 3DMark and 3D Center rankings are very close but not identical. The most notable fact in looking at the gaps between the two rankings is that it’s different for each GPU brand:

  1. For all Radeon GPUs, 3DMark mirrors the 3D Center averages with precision that's close to perfect, in both Raster and RT.
  2. For the RTX GPUs, 3DMark underestimates performance relative to 3D Center for the 3090 and 4080, but overestimates for the 4090.

The major discrepancies are in the extremes: Raster for the RTX 3090, RT for the RTX 4090. If I had to guess, causes may include the limited cache in the RTX 3090 (3DMark doesn't feature any test with open views, long draw distances and need for heavy asset streaming); or the big generational improvement of ray tracing in the RTX 4090, which should maximize any gap between benchmarks. But even for the NVidia GPUs, those gaps are small enough and 3DMark provides a very good proxy for 3D Center.

Conclusions

The four main 3DMark tests are an excellent proxy for the relative scoring: they show the same trends from reviewers testing real games, and they track the relative performance of different GPUs very well. The rankings from the best reviewers are still considered superior, but the 3DMark tests are close enough that I would easily make a decision to purchase a GPU based on the Time Spy and Speed Way scores alone.

Notice that there’s no strong, objective reason to pick 3D Center’s ranking as the Gold Standard. "Testing real games" is a tempting argument, but the flipside is my whole soapbox of methodological problems: varied and arbitrary selections of games, some reviewers performing manual tests, bias in games better optimized for one vendor, noise from different testing platforms (CPU/RAM/etc.), games with well-known CPU/engine bottlenecks and other performance badness that are nevertheless added to averages. The 3DMark tests are clear of all such problems so I could easily make the case that it’s the red bars in my chart that contain the One True Ranking, and any mismatch in the blue bars show error from real-game reviews.

Having said all that, 3DMark is not perfect. I already pointed at a possible blind spot for memory-intensive games; a few suggestions more. Avoiding bias is a moving target: their tests are “developed with expert input from AMD, Intel, Microsoft, NVIDIA” — but that was before Ada, RDNA 3, and Arc. I think it’s also past due for an “Extreme” variant of all the tests featuring ray tracing in 4K. You can make a custom run, but only the standard preset produces certified and searchable results, and 4K might also require higher-detail assets. Finally, the dedicated tests of upsampling systems are great but I'd love an option to also enable any of the big three (DLSS 2/3, FSR 2 or XeSS) in all tests featuring ray tracing; including searchable results with a filter by upsampling system and version.

Reviews and benchmarks based on actual games still have value, for example for readers that are focused on the specific games used by the reviewer, or very similar games in some niche (e.g., car racing). These reviews are also useful for pointing out other performance factors such as CPU bottlenecks, ray tracing overhead, or frame pacing. Personally, I find these kinds of analyses more useful than the relentless scoring for dozens of games, which takes a lot of work but often reads like filler content.

--

--

Osvaldo Doederlein

Software engineer at Google. Husband, Father. Likes science fiction, gaming, PC hardware, tech in general.