Teraflops vs. Performance — The Difficulties of Comparing Videogame Consoles
Blast Processing is still around, I see.
It turns out the Console Wars of the 1990’s never died, and once again a subset of enthusiastic fans from all sides are comparing arbitrary box stats to measure fun. I’m happy to see kids of all ages are still energised about evangelising their favourite games and machines, but as usual, the field is rife with misinformed marketing buzzwords and oversimplification of complex concepts.
We’ve been going through this for generations, from our obsession with bits, megahertz, Blast Processing, polygon counts, all the way through to refresh rates, net tick rates (so God help me…), negative latency, and now, teraflops.
Each generation of console release has brought about its own challenges when making comparisons. Before the Xbox One and PlayStation 4, we’d been comparing systems of entirely different architectures — PowerPC vs Intel x86, RISC chipsets, 68000, 6502 — as well as custom video cards, on-board co-processors, and hugely varying memory models. Some systems released in the same year had miles of differences between them in terms of raw processing power, but in reality, comparisons of how they performed in real-world scenarios still wasn’t so simple.
A quick look back at PS1 vs Saturn vs N64, Xbox vs Gamecube vs PS2, Xbox 360 vs PS3, and Xbox One vs PS4 shows countless examples of cross-platform games that should have run faster on one particular system when looking at the specs alone, but the benchmarks proved otherwise. We’re now in an era where major consoles are sharing the same underlying architecture, and some main system components comprise the same chips from the same manufacturer — some even the same model numbers — differing slightly by clock speeds, implementation, firmwares, and APIs. If we’ve not seen a slew of reliable and predictable FPS differences between previous machines so far, we’re certainly not going to suddenly see them moving forward. The gap is closing, not widening.
So, why is this? Why would a system that looks significantly more powerful than another on paper not yield a predictable result in real-world game benchmarks?
This is a very quick summary of the complications involved in judging game performance based on raw stats of its host machine, from the experience of a game engine programmer.
To preface, I take no side in any debate. I am a cross-platform developer whose job would just be easier if every console was the same machine with a different logo slapped on the box. This article is to help you judge for yourself what’s important when debating the horsepower of a videogame system, and why picking any one hardware statistic to use as a benchmark won’t give you the answers you need.
Who am I?
I’m an ex-Crytek, ex-Traveller’s Tales programmer and have been programming for around 15 years, with 24 titles to my name from the LEGO series, to Homefront: The Revolution, to ports of Goat Simulator, TimeSplitters 2, and Rust, plus a few indie titles, and some games not yet announced. I’ve also dabbled a lot with retro machines like the SEGA Master System, Mega Drive, Saturn, Dreamcast, Xbox, PS1, PS2, and recently the SNES and N64.
My main interests are in game engine architecture and optimisation, and I’ve worked with many mainstream engines including CryEngine, Unreal Engine 3 and 4, Unity, and other proprietary frameworks, all at the source level.
Which car would win a race: a Mitsubishi Evo 2.2L, or a Subaru Impreza 2.0L? Who knows. 2.2 is a bigger number than 2.0, but there are hundreds, if not thousands of factors other than its cylinder capacity that would affect the result. Each engine has many different systems driving and supporting that raw power, and all need to work together in harmony to deliver the right fuel mix, time sparks to the microsecond, and control air flow, all whilst managing temperature, torque distribution, traction control, and a whole list of minute details. Each race team will tune the car differently, constantly profiling and optimising based on conditions. The result of the race will also vary by driver, track layout, terrain type, weather conditions, and race rules.
A 10% difference in cylinder size isn’t going to result in a 10% difference in performance. Even if you bore out the cylinders in the same car, it’s not enough to see a worthwhile improvement; you’d need to re-tune all of the supporting systems to properly handle the change, including re-flashing the ECU, adjusting gear ratios, fuel and air mix, and you would only be able to determine the final gain by getting it back onto rollers and performing another test, and even then it won’t scale linearly.
So, which computer would give the highest FPS in a game: one with a 2.0 teraflop processor in its GPU, or one with a 2.2 teraflop processor in its GPU? Now we need to look at different clock speeds, bus bandwidth, cache sizes, register counts, main and video RAM sizes, access latencies, core counts, northbridge implementation, firmwares, drivers, operating systems, graphics APIs, shader languages, compiler optimisations, overall engine architecture, data formats, and hundreds, maybe thousands of other factors. Each game team will utilise and tune the engine differently, constantly profiling and optimising based on conditions.
A 10% increase in any one of those individual stats isn’t going to give you a 10% increase in performance. The machine as a whole would need to be slightly increased to see an overall benefit, and you would only be able to determine the final gain by profiling each game, and even then it won’t scale linearly. If this was a consistent and predictable thing to calculate based on raw box stats, we wouldn’t need Digital Foundry.
A side note worth mentioning is the comparison of consoles vs their equivalent PC hardware; it’s important to bring up why this is an unfair method. These are dedicated gaming machines, and every part of their specs, operating systems, APIs, and ecosystems are fine-tuned for that end goal. The parts of the system available to developers, are for developers exclusively. They are able to optimise code to run as fast as possible within it, for that very specific hardware revision only, with no interruptions from other programs or the operating system, and with no care for how the game interacts with other processes running in the background, nor any other complications of running on a shared, multitasking OS.
Another flawed PC comparison is directly comparing the chipset of a console’s GPU with the equivalent standalone graphics cards. Even if model numbers match up, these chips are soldered directly to the mainboard, run at different clock speeds, and are paired up with different RAM specs, bus types, as well as the drivers and graphics APIs that communicate with them. Every one of these factors affects their performance.
So, teraflops, then
Aren’t processor speeds measured in gigahertz? Where do teraflops come into all of this? For the uninitiated, a teraflop is the measure of the number of floating point operations a processor can perform per second, typically inferring the raw mathematical throughput of the GPU, ignoring non-maths instructions and operations like branching. This is an important statistic in computing architecture dating back to when they were small enough to call flops, and is a key selling point when choosing the right tools for many scientific computing applications such as supercomputing, algorithmic work on large data sets, CUDA, folding, and bitcoin mining, to name a few.
For games programming, it still matters, but its application is a bit different. A GPU with a high teraflops value is optimal at ripping through raw numbers, if that’s its sole job. The system feeding it the data needs to be doing so in a way that allows it to get on with the job uninterrupted, but a game will continually stop and start the work, pausing to change the working parameters, re-upload buffers, fetch results, invalidate caches, and will do so at an incredibly high frequency, again and again at 60 times per second. This isn’t the most optimal way to make use of its raw throughput — but that’s to be expected, and the GPU manufacturers know what they’re doing — so other, gaming-specialised parts of GPU technology exist to try and make up for the wild and unpredictable behaviour of game rendering work.
The GPU Pipeline
A gaming GPU has a pipeline; the work to be performed begins life as raw mesh data, textures, lighting information, and other input buffers, and in several stages that data gets transformed into the real colour output for each individual pixel, through a series of shaders. This involves a lot of fetching data from video memory at extremely high frequency — texture sampling, table lookups, instancing buffers — to name a few. At 4K resolution, one full-screen, uninterrupted draw means a shader is run 8.3 million times, each of these requiring several data lookups from either video memory, cache, local registers, or combinations of all. This work, at least, can be parallelised, so core counts and number of compute units come into play here, to an extent.
For each of those 8.3 million slices of computation, in each individual part of this pipeline, those teraflops are going to play a key role if, and only if, the shaders and input data were written in a manner optimised for raw mathematical throughput. In reality, the rest of the engine needs to be architected in a way that allows for such an optimisation to be practical in this context, but maintaining a constant and consistent stream of data to the GPU in exactly the right layout to do this is generally not feasible.
Instead, video memory access speed — and any functionality present to mitigate memory access, like local cache, register count, and compiler optimisation — starts to become very important for the type of data and computational bursts used in game rendering. There is a lot of data moving around, and the management of that memory is costly. Lesser quantities of memory also results in a juggling operation trying to keep relevant parts of data in video RAM, and anything not needed right now either back in main RAM, cached to disk, or dumped entirely.
Low latency memory movement versus raw computation speed is a delicate balance.
How do programmers measure performance?
To look for bottlenecks, we need to see the whole picture. All first party platforms, and all major engines, come with a suite of profiling tools to look at whole game frames and the activity of the various chips involved, and show a thorough breakdown of where the game is choking.
From this data we can see a day in the life of a game frame, and the amount of time taken up by each individual operation. Here’s an example of a single frame, from Unity’s Mega City demo, running through its profiler:
Not shown are the parallel operations running on other cores. Here’s a very truncated, high level view of events, vaguely in order:
Collect gamepad input, reset UI, update streaming system, update texture streaming, update all game scripts, tick physics simulation, process collision, process raycasts, check results of threaded jobs, update object positions, check bounding box/trigger box collisions, update sound effects, update streaming audio, update light positions, tick animation system, compute HD lighting intensities, tick cinematics timeline, update constraints, sort objects into render batches, update sound channels, update UI layout, C# garbage collection, update lighting probes, clear buffers for rendering, update transform hierarchies, cull visible objects from camera, cull visible lights, cull visible shadows, update post effect volumes, copy global rendering params, and finally preparing all GPU data: light data, depth prepass data, opaque buffer data, HDR data, lightmap data, reflection data, SSAO data, shadow data, deferred pass data, exposure data, motion blur data, bloom data, procedural drawing, and final render target blitting command.
By comparison with other games and engines, this is a relatively simple game frame. Other profiles will fare differently, and I would expect many to saturate more of the other CPU cores with, for example, physics and network ops, but they would still have enough non-rendering work scheduled to prevent a constant and consistent stream of data being fed to the GPU.
Using another tool, Renderdoc, we can dig into detail the GPU side of this frame. A Renderdoc profile allows us to capture and view each individual stage of the rendering process, view the state of the GPU at any time, view the output at each stage, and the raw mesh, texture, shader, and other buffer inputs for each model, material, and rendering pass. It also shows a neat timeline of events:
A typical GPU frame, here: a series of over 23,000 operations. Each operation introduces a change to the GPU’s state and input parameters, and an interruption to its processing, which is one of the key complications when trying to achieve saturation of its raw processing power.
A game involves significantly more work than just running a constant mathematical equation on a GPU and displaying the output. A game frame as a whole is complicated, and each individual micro-operation and the hardware components involved could be a bottleneck at any time.
There is a tremendous cost to computing the work the GPU requires to actually run, and if the GPU is idle at any point during that time, even for a couple of milliseconds, it’s bored, and its power temporarily useless. If a GPU’s power is increased beyond the rest of the system’s capabilities to provide it with data, it doesn’t matter if that’s by 1 teraflop or 500 teraflops, it wouldn’t result in a single FPS difference. It’s a careful balance.
The problem with varying specs
I mentioned the gap between systems are closing with each generation, but from a cross-platform developer’s point of view, that’s a good thing. Optimisation is incredibly time consuming, and needs doing for each system in turn. If one system is of an entirely different architecture or contains helpful hardware that another doesn’t, this may mean whole sections of optimisation are specialised and can’t be ported across to another machine. For a first party developer, this is a non issue, but that also throws comparison out of the window — there is no way to compare one highly optimised first party title running on System A with a completely different highly optimised first party title running on System B.
We saw this with the SPU’s of the PlayStation 3. Making effective use of those extra chips required reworking entire engine subsystems to make the best use of them, effectively branching off whole chunks of the renderer and parallel job systems into specialised versions. It was a maintenance headache.
We also see this not just with raw processing power, but also peripherals. The Wii U was an interesting machine, and its first party titles made exceptional use of its touch screen controller, but it was common to see cross-platform titles phoning it in with a quick mini map. Maintaining a specialised version of the game suited to just one machine usually isn’t worth the time or effort of the team needed to design, maintain, QA, and bug fix those features.
The fact that game performance has — and will continue — to bamboozle gamers, developers, and hardware designers, is because a gaming system is greater than the sum of its parts, and there are more factors external to the hardware that affect a game’s smooth running than there are internal. No one statistic is a measure of power of a console, there are too many variables, and no one calculation to produce a result. It varies per game, per engine, per firmware, per development team, and per patch, and it always will.
Take a leaf from Digital Foundry’s book — their methods of diagnosis really hold all the performance answers here. Fair and thorough benchmarks run on real field tests of real games, side by side. If you wanted to know which particular games run best on which particular consoles and setups, this is how you’d go about it, but as always, performance is only one piece of the puzzle. There are plenty of other factors to consider when choosing a console.
Put your teraflop away. And wash your hands.