Benchmarking the GPU Benchmarks

Osvaldo Doederlein
10 min readJan 8, 2023

--

In my benchmark &overclocking pieces on the Radeon 7900 XTX, I used the built-in benchmarks of several popular games. I also tested every game at its absolute maximum settings.

Now I’m benchmarking the same games with “potato” settings — the lowest configurations possible. I want to investigate the scalability of game engines and the implications for playing games and for testing GPUs.

I also wanted to investigate the quality of these built-in benchmarks, in light of recent debates about manual vs. automated execution or benchmark selection. See for example this meta-review: in particular, TechPowerUp scored the 7900 XTX only 16% below the 4080 in RT, but most other outlets said 25%. And all of them are “right”, they just tested different games or tested some games with different benchmarks.

Results

There is a wide range of impact of graphic settings across all games in my suite of 11, so my relative scores — improvement in framerate from top to bottom settings — are all over the place. I will comment on the results and also on the quality of each game’s built-in benchmark.

Callisto Protocol
This is a severely CPU/engine-limited game, but that problem is more specific to its heavy use of ray tracing. Turn RT off and Callisto scales well, shooting from 72fps to 193fps, a healthy +168%.

The benchmark is good, with a fair representation of the game’s graphics and animation. It’s missing some combat but this is not very heavy visually in Callisto where the horror and ambiance can do with modest visual effects in combat. Well, except that you can sometimes kill enemies (or be killed) in spectacularly gory ways that will throw blood around like a farm’s irrigation sprinkler, I wish the benchmark had some of that.

In a unique problem, Callisto’s benchmark uses a locked aspect ratio so it won’t represent correctly any resolution that’s not 16:9. To me, this is a bug.

Update: I later also found that Callisto's built-in benchmark suffers extremely bad frame pacing, which doesn't happen anywhere near the same scale during normal gameplay. It's odd that the benchmark would badly misrepresent the game's performance even if it's only in the regularity of frame-time.

The ultrawide view of this scene was not allowed by sanitary authorities.

Shadow of the Tomb Raider
SOTR runs well enough at 110fps, max settings including its entry-level RT, and it scales well with faster GPUs. It also scales normally with lowered settings, +112% to 233fps.

The benchmark is reasonable. No action, but this is a game where you mostly humiliate every IRL alpinist and shoot arrows. The benchmark could feature some of the hydraulic contraptions in the tombs, but I guess those large moving structures aren’t much harder to render with the game’s largely static, baked-in illumination (even with RT). I suppose the built-in benchmark is designed not to spoil any cool discovery.

Quake II RTX
A hard case for this particular study since the “potato setting” here is the original OpenGL renderer from 1997. That game engine is now old enough to get so drunk, it would pay $2,000 for a GPU. The difference between the worst the original engine can do and the best the new RTX remaster can is an astronomical 6,554%: 4,163fps compared to 62fps.

The classic idTech timedemo feature was built for simpler times. It renders a fixed sequence of frames (exactly 687 for “demo1”), which is ideal for reproducibility, no noise from frame pacing. But with the lowest settings, the test finishes in 0.2s on the 7900 XTX! That causes a large error margin in measurement — 10% difference from worst to best score of many runs. With that caveat, it’s a good benchmark that represents game action well.

The lowest and highest settings are essentially distinct games. But since the practice of remasters/remakes is increasingly common for old classics, I find Quake II RTX a useful view on the generational jump of rendering complexity and performance, even if a bit extreme due to the age of the original game and the radical path-traced remastering.

Benchmarking the menu screen. Yes, 7,300 FPS even if just in the menu is totally worth 300W and 1,700rpm.

Guardians of the Galaxy
This heavy game exactly doubles performance from 78fps to 156fps, another game that scales very well with rendering settings.

The benchmark unfortunately is poor with zero representation of combat or other dynamic, hectic visuals of this game. This may not make a big difference in scores; I tested playing an old savepoint I had at a heavy fight near the endgame and performance ranged from the low-70s to 100fps, averaging even a bit higher than the benchmark. But this random, manual check is not a replacement for a proper benchmark.

Horizon Zero Dawn
A modest range here, Horizon moves only +32% from the best possible settings giving 157fps, the worst 208fps. This is a pure-raster game so this range is not too short, but it’s short.

The benchmark is very poor, truly among the worst: it looks like those real estate websites with drone-filmed property fly-throughs, a long and tedious sightseeing of imponent scenery populated by NPCs busily doing nothing. It doesn’t show any of the machines that are the focus of most action; it doesn’t even show the protagonist, ubiquitous in this third-person game.

Check this out! The elevators work. Very secure. HOA fee is just 950 shards a month.

DOOM 3
This is a real “old game” (there is a remaster but I don’t use it in this series). No RT again. All graphic quality settings make no difference, performance from worst to best presentation is identical within the margin of error.

This is not unexpected. First, the game is very single-threaded, engine-limited to about 400fps on my system. Second, the rendering work is a walk in the park for any current hardware; options like 2004-era shadows, bump maps or AA are a rounding error in GPU effort. That’s a pity because the timedemo is a pretty good benchmark with lots of action, and unlike Quake II RTX’s, not accelerated at high framerates.

Dying Light 2
Back to a modern game, here we move a fantastic +274% from the hardest ray-traced setting at 53fps to the crudest raster at 194fps.

The benchmark seems only passable, there are some scenes with fights against zombies, but at a distance — I don’t understand why a first-person game is benchmarked without any first-person perspective. Also none of the running/parkour action that’s the major reason I bought this game.

Batman Arkham Knight
This final chapter of the Arkham series moves from 242fps to 281fps, only +16% comparing the lowest to highest settings. That’s a really poor range even without RT. It’s close to DOOM 3, except that Arkham is a 2015 game with DirectX 11 graphics that were state-of-the-art for that time and still a feast for today. We can blame the limited scaling in part on the CPU/engine constraints of most games of that age or older, and in part on the fact that a pure-raster game is just too easy on any current-generation high-end GPU.

Another built-in benchmark that could be a lot better, it shows a lot of game elements but not the game’s perspective or any significant action.

The bat-scores on top are not FPS, they’re how many squats Bruce does at 4am.

Red Dead Redemption 2
Modern but CPU-limited, RDR2 scales a modest +39%: 121fps to 169fps. One could argue that’s OK because there’s no RT in the max-settings side, but this game in particular features a very sophisticated presentation, I would go as far as calling it the Swan’s Song of rasterization — I don’t see how a game can look much better than this without ray tracing. With that in mind, I consider its settings-based scaling poor.

The built-in benchmark is excellent: it shows a good variety of levels, weather systems, and good ol’ Western gunfire. It’s a pity that some big reviewers prefer ad-hoc benchmarks for RDR2 because the built-in one is “too long”. Red Dead is a lecture on how to make a game benchmark.

Wolfenstein: Youngblood
A reminder that my max settings for this game don’t include ray tracing on Radeon (it only supports NVidia’s early extensions). The always-raster engine scales only +25% from 343fps to 427fps, for much the same reasons as Batman Arkham Knight.

There are two benchmarks, each a simple flythrough of one level. In this co-op game that combines a first-person perspective, always carrying a massive gun, and third-person of your companion, none of them show up. Also MIA is the fun as hell combat with hilarious dieselpunk war machines. It’s like a PG-13 trailer for an R-rated movie.

Hitman 3
As many reviewers observed, Hitman was updated (or burdened) with a ray tracing implementation that adds shadows and reflections for a heavy cost. This explains why the game jumps from 52fps to 324fps, a massive +525%.

The benchmark is a flythrough of several spectacularly-looking levels, full of oligarchs gathering and spending their money. But that’s alright, no? This is a game where you mostly walk your way calmly into those parties and shiv the Sheikh when nobody’s looking.

Birthday party setup when your kid loves Minecraft and you are filthy rich.
But your luck won’t last: agent 47 (wearing a wig) approaches unsuspected.

Scaling: Rendering vs. Upscaling

Low rendering settings helped back in the day, but today we have advanced supersampling technology, right? Even before upgrading from my last-gen GPU, for games that needed some tuning and had support for FSR 2, I always tried to only use that and keep all other settings on High/Ultra.

This logic makes sense when targeting high resolutions (minimum 1440p) and when a limited use of upscaling (“Performance” mode) is sufficient to deliver the desired framerate. More aggressive upscaling, or any of it at lower resolutions, is a much more noticeable impact on image quality and many gamers are still well-served with fine-grained rendering settings that make a difference. Not everyone cares about everything; if you hate motion blur or lens flares, or you are completely unimpressed with high-quality hair rendering, disabling such features is a 0% loss of subjective image quality. Smart upscaling is great, but it’s never zero-loss great.

Conclusions

Game engine scaling is an excellent feature for gamers, allowing those of us with older or entry-level GPUs (and consoles) to enjoy the latest games even if not in their full glory. And considering how much time I still put on the 1993 DOOM, a great game can be massive fun without ultimate bling. Scaling is also useful in cases like competitive esports gamers willing to sacrifice a lot of visual quality in return for the highest possible framerates.

Rendering settings have widely different impacts on performance, from virtually nothing (e.g. antialiasing methods) to huge (badly-optimized RT). Digital Foundry’s optimized settings are a great resource for users of midrange or low-end GPUs, and sometimes even for high-end.

Many games in my list feature ray tracing as an option, greatly expanding the range from the lowest to the highest settings. As we enter year 4 RTE (Ray Tracing Era), the growing pains of this technology are still present. Big AAA titles are multi-year projects so even large franchises like Hitman are sometimes taking their first steps, often delivering sub-optimal results. Hopefully that factor will disappear over the next few years, with more game engines written from the ground up for RT instead of shoehorning that feature on a legacy architecture.

For GPU benchmarking, these results also show that it still makes sense to report separate Raster and RT scores for each game that features both. I didn’t make finer-grained testing to show it, but comparing even the highest Raster settings to the lowest RT should often result in a large gap in performance: in early or entry-level RT-capable GPUs because ray tracing is too hard for them, and in the latest halos because rasterization is too easy for them and sometimes becomes CPU/engine-limited.

I tried to judge the quality of each built-in benchmark based on how well they represent the visuals in the game, and some are much better than others in that criteria. But like I commented for Guardians of the Galaxy, it’s not necessarily a problem for producing a score that’s true to the performance of actual gameplay — you mostly need a higher degree of trust in the developers that chose what to include in the test. I would still urge studios to have built-in benchmarks that use the same perspective and other behaviors as the actual game and sample all main activities.

In the debate about methodology, I land strongly in favor of standard benchmarks for everything, it’s the only way to have reasonably equivalent results across reviewers or across first-party and independent benchmarks. Most of my tests have scores within 1% of the average over multiple runs. In the lack of a minimally passable built-in benchmark, automation should be used: I am 100% with Dr. Cutress. I have heard counter-arguments from other reviewers based on the difficulty to keep automation scripts working across game patches, but those excuses would not be taken seriously in any other domain of computer technology.

--

--

Osvaldo Doederlein

Software engineer at Google. Husband, Father. Likes science fiction, gaming, PC hardware, tech in general.