The Highs & Lows of Frame Counting

Osvaldo Doederlein
9 min readFeb 10, 2024

--

I recently came across Duoae’s rants (Part I, Part 2) about FPS reporting. Another poor fellow that like me, has a background in data science and gets annoyed by the methodology of gaming tech channels. I share that feeling but don’t envy pro reviewers: summarizing too much data in easy visualizations is hard. If you need to compare 25 benchmarks in one nice, readable chart, there’s little recourse but reducing each test to a tiny set of numbers. But maybe we can make progress within those constraints.

Benchmarks often include the Average FPS, the primary indicator of overall performance, and the 1%-Lows to highlight worst-case behavior that would be hidden by the average. But both metrics have limitations.

Principling Frametimes

Let me start my takes with some sample data. (Please zoom images!)

(I’m sorry for the non-zero-based vertical axes in some charts, usually a major sin but I’m trying to overlay data lines with different domains in the same chart for easy comparisons where only the shape of each line is important.)

That’s Talos Principle 2, 4K max settings, FSR3/Native + Frame Generation, Ryzen 7900X & Radeon 7900 XTX. In display: raw frametimes (940 samples in 10s) and its 100ms-quantized average of 10.58ms; the 100ms- and 1s-quantized FPS over time and the 100ms-quantized FPS Average of 94fps. Bear with me, this quantization business will be explained soon.

Only the frametimes are raw data; everything else needs some discretion, like the choice of smoothing or quantization factor. And this chart takes a lot of space; can we just say the game runs at 94fps with 1%-Low of 87fps?

That could be an easy question with very smooth and stable performance, but let’s dig into that stuttering event at ~3.5s. The raw frametimes:

…, 10.85, 10.88, 11.10, 14.11, 8.23, 14.53, 8.54, 11.65, 11.39, 11.63, …

Two frames jump to 14ms; we also have two dips to 8ms. This “zig-zag” appears because our raw GPU telemetry does not contain frame rendering times: those numbers are deltas between frame presentation times. With some effort in analysis, we have a more complete picture.

Blue bars are raw frame (presentation) times. Red bars show the error or delay between each frame’s relative presentation time and a deadline or target, presumed as 10.8ms (average of the last samples before the stutter). The game engine's pacing logic is trying to show a new frame each 10.8ms. But one frame takes 14.11ms to present(*), blowing its deadline by 3.73ms. Because I’m using FSR3/FG the next frame is fast (interpolated), but the next one is rendered and it takes 14.53ms to present, 3.29ms delayed.

(*) This game is making 98fps with FG, so pre-FG performance should be 49fps with rendering time near 20ms: that stutter might be a 24ms frame. Or a little higher if the frame pacer adds a tiny buffer to smooth normal variability.

The first important finding here is that there aren’t real “frametime dips” (except with FG: interpolated frames are always much faster than rendered ones). The two "fast" frames that follow each stutter, with times of 8.23ms and 8.54ms respectively, are being presented very close to their deadlines; in fact they have small delays, 1.16ms and 2.63ms. If any frame completes much faster than expected, pacing will wait more to present it. However, all frames beyond that stutter will exceed the old deadline: as you can see in the first chart, the game stabilizes at a higher average frametime.

Frame pacing is not static; it should adjust its presentation deadlines to keep animation smooth, adapting to rendering performance. When the frame pacer observes that peak of 14.11ms it bumps the target to 11.6ms(*). This produces the yellow bars: we can still see two peaks, but the second one causes a smaller delay to meet the slightly higher deadline, and all frames after the stutter land exactly at their new ideal presentation targets.

(*) I cannot inspect or debug the game so I am deducing both pacing targets, simply picking numbers that make the presentation times perfectly spaced in each section of the data (before and after the stuttering event).

From the gamer’s subjective point of view, this is what happens:

  1. There are two frames with abnormal time, both ~3.5ms above average. This can be perceived as a 2-frame stutter (or “2-ish” because of FG).
  2. All other frames are perfectly smooth, but after-stutter, the game’s average framerate drops from 92fps (10.8ms pacing) to 86fps (11.6ms). That’s a 7% drop but it’s not too jarring and the new framerate is still very good, not likely noticed as a separate / additional problem.

Some readers will notice that the analysis above reminds of derivatives — we are looking at the difference between each value and a baseline, and with dynamic frame pacing the baseline is roughly the previous value (or a moving average).

Theorizing Quantums

Let’s now consider the 1%-Lows metric. In our benchmark the average is 94fps; that was calculated specifically from the 100ms-quantized FPS data. The 1%-Low is 87.49fps, also with the same quantization. What’s that all about? Suppose we have samples like these, in milisseconds:

[8, 11, 8], [6, 9, 5, 10], [9, 11, 10], [14, 13]

There are different ways to prepare this data for metrics like averages and percentiles. How can you calculate the average?

  1. Use raw data directly: SUM(8 + 11 + … + 13) / 12 = 9.5ms.
  2. Use quantized data: For example, let’s quantize by 30ms windows. Our samples are already separated with the first bracket for [0ms–30ms), the second for [30ms–60ms) etc. First we take the average of each bracket, then we average the values for each bracket. Final average = 10ms!

Option 1 is simpler, but wrong. Well-performing sections of the test will have more samples per unit of time than slower sections, so smaller values are over-represented in any metric that gives same weight to every sample.

For performance data such as frametimes, quantization is the only choice. You can’t have an average, percentile, minimum, anything off raw samples. Moving averages are not a solution: each output will depend on a variable number of inputs, with a bias for fast-performing sections. You can use moving averages, as long as you calculate it from quantized values.

The same goes to percentiles, such as the popular 1%-Low. You should always start from quantized data = exactly one value per unit of time. Your spreadsheet / code should produce a value of zero for any time bracket that might not have any samples, e.g. because of a severe stutter.

Surviving Real Stutters

Speaking of 1%-Low, for those readers still with me, let’s go back to game benchmarking. The single stutter event in our game is short, and small when we interpret frame presentation times correctly. I didn’t perceive any stuttering while running that test. But that’s about to change now.

This new capture is from (what else?) Jedi Survivor. Test: 4K Epic including RT, FSR2/Quality. I booted the game, loaded a save in Koboh, then I started playing with the outcome captured above. To be fair this is a worst-case “cold run”: empty shader cache except what’s compiled at startup, exiting the Pyloon’s Saloon, some jumps with the grappling hook, fast camera movements. But many games survive this kind of test much better.

The chart doesn’t need much explanation, especially the frametime line. The 100ms-quantized FPS line is also a big mess. Notice I’m using v-sync so I expect some kind of frame pacing to happen although the game never gets close to my display’s 144Hz. There’s no Frame Generation this time, still we can see frametimes jumping up and down: the average is 13.70ms but raw samples go as low as 8.03ms and as high as 128.25ms.

The proof of the pudding is subjective experience: when I executed this test, it was Bad. Dark Side of the Force Bad. But the 1s-quantized FPS line doesn’t look too bad; if I had smoothed it to a curve it would only show the framerate falling from ~100fps to ~50fps but without any abrupt change. Empirically that 100ms is a much better quantization factor than 1s. Humans are very sensitive to events at the 100ms scale and even smaller.

Framerate: Stay Smooth

We should now validate our findings with some game that doesn’t show stuttering. I picked Dying Light 2, which is not free of that problem if you push it, but very stable for a heavy ray-traced game. Test: 4K Max settings, native rendering — a bit too hard on my GPU, but intentional to get higher frametimes than the other tests. I also captured a 4X longer test of 40s, still the number of samples is significantly smaller than the other two; all those changes are to evade the quantization factors that are similar between the other tests (number of samples per time window / percentile). The game is heavy on platforming, I tested it by running and jumping a lot.

All performance lines show constant, significant changes of performance as I run into new areas or turn the camera to views that are harder or easier to render, but there’s no noticeable frametime peak ever.

The bar chart above is similar to what you see in many game-tech channels starting with the Average and ending with the 1%-Low for each game. But extra bars in the middle show the 100ms- and 1s-Minimum FPS values too. In Talos Principle 2 and Dying Light 2 all values are very close; the 1%-Low ties the 100ms-Min (it’s even higher in DL2), and only 3% below the 1s-Min. But in my Jedi Survivor torture-test, both Minimums are abysmal and the 1%-Low is a whooping 44% lower than the 1s-Min.

Some reviewers use bar plots with error bars, which is great but I think the 1s-Min is also very useful as a point of reference to understand the 1%-Low. A 1%-Low that’s very close to the 1s-Min is arguably insignificant.

Statistics can lie, but sometimes the lie is mostly a matter of precision and presentation. We have to pick which of many possible metrics to show and how to calculate them, with empirical choices for factors like quantization. But the result can be useful. GPU/game reviews are not scientific papers, they don’t need the ultimate rigor in statistical analysis. As long as their creators are mindful of the limitations of their methods before making big generalizations about game optimization or hardware performance.

In comparative benchmarking, if two GPUs have close-enough averages but one has higher 1%-Lows, that one is always better — but in both results, the 1%-Low only matters if significantly below the 1s-Min. With the obvious condition that all numbers are rock-solid reproducible across repeated test runs with acceptable error bars, like any other result worth publishing.

To end with a plug, one of my favorite hobbies is complaining about use of manual ad-hoc benchmarks; see my posts about built-in benchmarks and synthetic benchmarks. Both have excellent precision and reproducibility, that I doubt that most manual tests can match even with experience and good methodology — but that’s material for another rant.

--

--

Osvaldo Doederlein

Software engineer at Google. Husband, Father. Likes science fiction, gaming, PC hardware, tech in general.