Inside the fastest font renderer in the world
Over the past couple of years on and off, I’ve been experimenting with pushing the limits of software font rendering performance. The result is font-rs, a font renderer written in Rust, almost an order of magnitude faster than the industry-standard FreeType. At this point, it’s more of a technology demo than a production-quality library, but it’s certainly far enough along to analyze its speed. In this post, I’ll talk about exactly how it achieves such remarkable performance.
First, the numbers. This is a measurement of how long it takes to render “g” from Roboto Regular, at pixel/em settings between 1 and 200. Measurements were done on a 2.9 GHz Xeon E5–2690. The comparisons are FreeType 2.6.5, font-rs using only pure Rust, and a version of font-rs with SIMD speedup.
With the SIMD speedup, font-rs is approximately 7.6x faster than FreeType in larger sizes (keep in mind that 42 pixels/em is the default for xxhdpi Android devices). Even at a relatively small size of 12 pixels/em, it’s over 4x faster. The speedup is even bigger (over 6x at 12 pixels/em) when measured on a CJK typeface (the majority of CPU cycles spent rendering fonts in the world).
Font rendering, where does the time go?
There are basically 3 CPU-intensive components to font rendering: parsing the binary outline from the font, drawing into an accumulation buffer, and integrating the accumulation buffer, storing the results into an 8 bit per pixel bitmap.
During drawing, the accumulation buffer stores the finite difference (in scan-line order) of the actual signed area. All pixels fully inside or outside the glyph (not touched by the edge) are zero. More importantly, in the first difference, the sum of the signed area is just the sum of the contribution of each of the edges.
For more details on the signed area approach, see How the stb_truetype Anti-Aliased Software Rasterizer v2 Works.
Here’s a graph showing the breakdown:
One interesting observation is the contrast between asymptotic complexity and associated constant factors. The accumulation buffer is O(n²), so allocating it and filling it with zeros is also quadratic. But it’s such a trivial operation that it hardly contributes. Drawing the path is O(n), but with a large constant factor because it’s doing complicated exact-area calculations for each pixel crossed by the outline. Finally, integrating the accumulation buffer is also O(n²), but insignificant at sizes up to 60 pixels/em or so. For very large glyphs, the quadratic parts would dominate.
Parsing: no allocation
Font-rs takes about 460ns to parse the binary font data and convert it into a sequence of lines and quadratic Bezier curves. There are 30 such, so it’s about 15ns per path element.
Usually, binary parsers are written as a function that accesses the binary font data, and produces a data structure (often a vector) containing the result. However, just collecting the result into a vector per contour would add 870ns, almost tripling the parsing time. At normal text sizes (where total rendering time is ~5µs), this added time would be significant.
By contrast, in Rust it is very idiomatic to write the parser as an iterator, an object with a `next` method called repeatedly. The state for these iterators is usually allocated on the stack, rather than on the heap. Thus, parsing takes place with absolutely no heap allocations.
The font-rs parser is actually written as two iterators, composed together. The first parses the binary font data into a sequence of on-curve and off-curve points, and the second converts that into lines and beziers. It’s possible to write Rust-style iterators in other languages, but it’s not especially common to see. Note, though, that there is a library for C++, intended to eventually become part of the standard.
Drawing into the accumulation buffer: dense and immediate
Different renderers use different strategies of how to store the accumulation buffer, and when to sequence drawing operations into it. Many renderers (including FreeType) use a sparse representation, trying to exploit the fact that many entries remain zero. Further, many renderers try to order the rendering by scanlines, or at least stripes. This requires some reordering, as the outlines are not stored in scanline order.
The approach of font-rs is to represent the accumulation buffer as a simple array, one 4-byte float per pixel. Thus, addressing the buffer is one machine instruction, with no branches. In addition, all drawing operations come straight from the parser into the buffer.
Sparseness is always a tradeoff. At a certain point, not having to touch empty parts of the structure exceeds the overhead from the sparse representation itself. Where this tradeoff crosses over depends on the problem and the size of the datasets. However, dense representations have a huge advantage when data-parallelism is an option, as we’ll see in the next section.
Final integration: SIMD
The integration consists of four operations, applied to every pixel: cumulative sum, absolute value (to implement the nonzero winding rule), clamping to [0, 1], and conversion to an 8-bit pixel. There are SIMD techniques for cumulative sum, and the other three are embarrassingly parallel. In addition, since cumulative sum is usually latency-limited, interleaving the two can be even faster than doing them separately.
Another innovation in font-rs is to just run cumulative sum over the entire buffer, rather than restarting a loop at every line. Thus, the inner loop runs very “hot” with basically no overhead due to branch mis-prediction. In addition, the width doesn’t need to be aligned to any power-of-2 boundary.
I wrote the inner loop in about 15 lines of SSSE3 intrinsics. Since Rust doesn’t (yet) have SIMD intrinsic support, I wrote those in C. Fortunately, dropping a C function into a Rust codebase is pretty easy. At larger sizes, the speedup is dramatic, more than double. At smaller sizes, the impact is not so great, dominated by the O(n) path drawing.
Possibly font-rs becomes production-ready. This would depend on community interest. Another definite possibility is to adapt the ideas into existing renderers such as FreeType. In any case, font-rs shows what performance is possible. It’s also a strong demonstration that Rust is an appealing language for creating “fastest in class” modules. It has essentially the same performance as C++ for low-level fragments of code, but better support in the language and standard library for stitching these low-level fragments into higher level constructs that match the semantics of the problem being solved.
The future of fast rendering may not be software. New advances in GPU-based rendering, particularly Multi-channel signed distance fields, may yield even higher performance. However, those techniques usually require a time-consuming preprocessing step, and nontrivial memory to hold the intermediate bitmaps. For the task of going from a binary font to rendered pixel data, software renderers such as font-rs still deliver the absolute lowest end-to-end latency.