Any developer with a few years of experience writing client-side applications is acutely aware of how complicated text rendering can be. At least that’s what I thought until 2010 when I started writing libhwui, an OpenGL backend for Android 3.0’s 2D drawing API. I then realized that text gets even more complicated when you’re trying to use a GPU to draw it on screen.
Text and Android
Android’s hardware accelerated font renderer was originally written by a co-worker on the Renderscript team and was then improved upon and optimized by several engineers including my good friend Chet Haase and I. You can easily find many tutorials on how to render text with OpenGL but most — if not all — articles focus on games and conveniently avoid dealing with difficult issues.
The approach described here is by no means novel but I thought it would be convenient for some developers to get a high-level overview of how a complete GPU-based text rendering system can be implemented. This article also describes a few optimizations that are easy to implement.
A common way to render text with OpenGL is to compute a texture atlas that contains all the needed glyphs. This is often done offline using fairly complex packing algorithms to minimize waste in the texture. Creating such an atlas obviously requires to know ahead of time what fonts — this includes face, size and various other properties — and glyphs will be used by the application at runtime.
Ahead-of-time font texture generation is not a practical solution on Android. The UI toolkit has no way of knowing in advance what fonts and glyph applications will need; applications can even load custom fonts at runtime. This is a major constraint but only one of many that Android’s font renderer must work with:
- It must build its font cache at runtime
- It must be able to handle a large number of fonts
- It must be able to handle a large number of glyphs
- It must minimize texture waste
- It must be fast
- Work equally well on low-end and high-end devices
- Work perfectly with any driver/GPU combo
Implementing the font renderer
Before we examine how the low-level OpenGL font renderer works, let’s start with the high-level APIs directly used by applications. These APIs are important to understand how libhwui works.
There are 4 main APIs that application use to layout and draw text:
- android.widget.TextView, a View that handles layout and rendering
- android.text.*, a collection of classes to create stylized text and layouts
- android.graphics.Paint, to measure text
- android.graphics.Canvas, to render text
Both TextView and android.text are high-level implementations on top of Paint and Canvas. Up until Android 3.0, both Paint and Canvas were implemented directly on top of Skia, a software rendering library. Skia provides a nice abstraction of Freetype, a popular Open Source font rasterizer.
As of Android 4.4 things are a little bit more complicated. Both Paint and Canvas use an internal JNI API called TextLayoutCache that handles complex text layouts (CTL). This API relies on Harfbuzz, an Open Source text shaping engine. The input of TextLayoutCache is a font and a Java UTF-16 string and its output is a list of glyph identifiers with their x/y positions.
TextLayoutCache is the key to properly support many non-Latin locales such as Arabic, Hebrew, Thai… I will not explain how TextLayoutCache and Harfbuzz work but I strongly recommend you learn about CTL if you ever want — and you should — to properly support non-Latin locales in your applications. This particular problem is rarely, if ever, discussed in tutorials about rendering text with OpenGL. Drawing text can be a lot more complicated than simply placing one glyph after another, from left to right. Some languages, Arabic for instance, go right to left and some, like Thai, even require glyphs to be positioned above or below previous glyphs.
All this means that when you call Canvas.drawText(), directly or indirectly, the OpenGL renderer does not receive the arguments you send, but an array of numbers — the glyph identifiers — and an array of x/y positions.
Rasterization & caching
Every draw call to the font renderer is associated to a single font. A font is used to cache individual glyphs. A glyph is in turn stored in a cache texture (but a cache texture can contain glyph from multiple fonts). The cache texture is an important object that holds multiple buffers: a list of free blocks, a pixel buffer, a handle to the OpenGL texture and a vertex buffer (the mesh).
The data structures used to store all these objects are fairly simple:
- Fonts are stored in a LRU cache in the font renderer
- Glyphs are stored in a map in each font (the key is the glyph identifier)
- Cache textures track free space with a linked list of blocks
- A pixel buffer is an array of uint8_t or uint32_t (for alpha and RGBA caches)
- A mesh is a buffer of vertex with two attributes: x/y positions and u/v coordinates
- A texture is a GLuint handle
When the font renderer initializes, it creates two types of cache textures: alpha and RGBA. Alpha textures are used to store regular glyphs; since fonts do not contain color information, we only need to store anti-aliasing information. The RGBA caches are used to store emojis.
For each type of cache texture, the font renderer creates several instances of CacheTexture, of various sizes. The size of the caches can be different from device to device but here are the default sizes (the number of caches is hard coded):
- 1024x512 alpha cache
- 2048x256 alpha cache
- 2048x256 alpha cache
- 2048x512 alpha cache
- 1024x512 RGBA cache
- 2048x256 RGBA cache
When a CacheTexture object is created, its underlying buffers are not automatically allocated. The font renderer allocates them as needed, except for the 1024x512 alpha cache, which is always allocated.
Glyphs are packed in the textures in columns. Whenever the renderer encounters a glyph that is not cached, it asks each CacheTexture of the appropriate type — in the order listed above — to cache that glyph.
This is where the list of blocks gets used. That list contains, for a given cache texture, the list of currently allocated columns plus all the available empty space. If a glyph fits in an existing column, it is added at the bottom of the occupied space in said column.
If all columns are occupied, a new column is carved out of the left side of the remaining space. Since few fonts are monospaced, the renderer rounds the width of each glyph to a multiple of 4 pixels (by default). This is a good compromise between columns reuse and texture packing. The packing is not optimum, but it offers as fast implementation.
All glyphs are stored in the textures with an empty border of 1 pixel around them. This is necessary to avoid artifacts when the font textures are sampled with bilinear filtering.
It is also important to know that when text is rendered with scale transform, the transform is forwarded to Skia/Freetype. This means that glyphs are stored transformed in the cache textures. This improves rendering quality at the expense of performance. Fortunately, text is rarely animated scaling and when it is, only a few glyphs are affected. I have run extensive tests and I couldn’t find a realistic use case where performance was an issue.
There are other paint properties that affect how glyphs are rasterized and stored: fake bold, text skew, text scale X (which is handled differently from the Canvas transform matrix), style and stroke width.
There are other ways to handle text on the GPU. Glyphs could for instance be rendered directly as vectors but doing so is rather expensive. I’ve also looked into signed distance fields but simple implementations suffer from precision issues (curves tend to become “wobbly”).
That said, I recommend you take a look at Glyphy, an Open Source library from Harfbuzz’s author, that expand on the signed distance fields technique and fixes the precision issues. I haven’t looked at it in a while but last time I did the shaders cost was prohibitive for use on Android.
Caching glyphs is an obvious thing to do but pre-caching is even better. Since libhwui is a deferred renderer (as opposed to Skia’s immediate mode), all the glyphs that will be drawn on screen are known at the beginning of a frame. During the sorting of the display list operations (for batching and merging), the font renderer is asked to pre-cache as many glyphs as possible.
The main advantage of doing this is to completely avoid, or at least minimize, the number of texture uploads mid-frame. Texture uploads are expensive operations that can stall the CPU and/or the GPU. Even worse, modifying a texture during a frame can create severe memory pressure on some GPU architectures.
ImaginationTech’s PowerVR SGX GPUs use a deferred tiling architecture that has many interesting properties but that forces the driver to make a copy of each texture that you modify during the frame. Since font textures are fairly large, it can be easy to run out of memory if you’re not careful with your texture uploads.
This actually happened with an application on Google Play. This app is a simple calculator that simply draws buttons with math symbols and numbers. The font renderer was however at some point incapable of rendering the first frame without running out of memory. Because the buttons were drawn one after the other, each one of them would trigger a texture upload, and thus a copy of the entire font cache. The system simply did not have enough memory to hold that many copies of the cache.
The textures used to cache glyphs being fairly large, they are also sometimes reclaimed by the system to give other applications more RAM.
Whenever the user hides the current application, the system sends the application a message asking it to release as much memory as possible. The obvious thing to do is to destroy the largest cache textures. On Android, the large textures are considered to be all the cache textures but the first one ever created (1024x512 by default).
Textures also get flushed when no space can be found in any of the caches. The font renderer keeps track of the fonts with an LRU but doesn’t do anything interesting with it. If needed, flushing the caches could be made smarter by flushing seldom used textures instead. This has not proved necessary so far but it’s a potential optimization to keep in mind.
Batching & merging
Android 4.3 introduced batching and merging of drawing operations, an important optimization that drastically reduces the number of commands issued to the OpenGL driver.
To help with merging, the font renderer buffers text geometry across multiple draw calls. Each cache texture owns a client-side array of 2048 quads (1 quad = 1 glyph) and they all share a single index buffer (stored as a VBO on the GPU). When a text draw call is issued inside libhwui, the font renderer fetches the appropriate mesh for each glyph and writes the positions and u/v coordinates into it. Meshes are sent to the GPU at the end of a batch (as decided by the deferred display lists system) or when a quad buffer is full. It is possible for multiple meshes to be required to render a single string, one per cache texture.
This optimization is easy to implement and greatly helps with performance. Since the font renderer uses multiple cache textures there are pathological cases where most glyphs in a string are part one of texture and some are part of another. Without the batching/merging optimization, a draw call would be issued to the GPU every time the font renderer needs to switch to a different cache texture.
I have actually seen this problem occur in a test app I was using for the font renderer. The app was simply rendering the “hello world” string with different styles and sizes and in one particular case, the letter “o” was stored in a different texture than the other glyphs. This would cause the font renderer to draw “hell”, then “o”, then “w”, then “o”, then “rld”. That’s 5 draw calls and 5 texture binds when only 2 of each are actually necessary. The renderer now draws “hell w rld” then the two “o” together.
Optimizing texture uploads
I mentioned earlier that the font renderer tries to upload as little data as possible when updating the cache textures by tracking the dirty rectangle in each texture. There are unfortunately two limitations with this approach.
First, OpenGL ES 2.0 does not allow the upload of an arbitrary sub-rectangle. glTexSubImage2D lets you specify the x/y and width/height of the rectangle to update inside the texture but it assumes that the stride of the data in main memory is the width of that rectangle. This can be worked around by creating new CPU buffers of the appropriate size but it requires knowing ahead of time how big the dirty rectangle will be.
A good compromise is to upload the smallest band of pixels that contains the dirty rectangle. Since that band is always as wide as the texture itself we can end up wasting quite a bit of bandwidth but it’s better than updating the entire texture.
The second problem is that texture uploads are synchronous calls. This can lead to fairly long CPU stalls (up to about a millisecond or more depending on the size of the texture, the driver and the GPU). This doesn’t matter too much when pre-caching works as expected but the issue can be felt by the user when using font-heavy applications or locales that use many glyphs (such as Chinese).
OpenGL ES 3.0 thankfully offers a solution to these two issues. It is now possible to upload a sub-rectangle using a new pixel store property called GL_UNPACK_ROW_LENGTH. This property specifies the stride or the source data in main memory. Be careful though: this property affects the global state of the current OpenGL context.
CPU stalls during texture uploads can be avoided by using pixel-buffer objects, or PBOs. Like all buffer objects in OpenGL, PBOs reside in the GPU but can be mapped in main memory. PBOs have many interesting properties but the one we care about is that they enable asynchronous texture uploads after they are unmapped from main memory. The sequence of operations becomes:
glMapBufferRange → write glyphs to buffer → glUnmapBuffer → glPixelStorei(GL_UNPACK_ROW_LENGTH) → glTexSubImage2D
The call to glTexSubImage2D now returns immediately instead of blocking the renderer. The font renderer currently maps the entire buffer in main memory and even though it doesn’t seem to cause performance issues it would probably be a good idea to try and map only the range required to update the cache texture.
These two OpenGL ES 3.0 specific optimizations appeared in Android 4.4.
Text is commonly rendered with drop shadows, a fairly expensive operation. Since neighboring glyphs will blur into each other, the font renderer cannot pre-blur glyphs independently. There are many ways blurring could be implemented, but to minimize blending operations and texture sampling during a frame, drop shadows are simply stored as textures and survive across multiple frames.
Since applications can easily max out the GPU, we decided to rely on the CPU to blur text. The easiest and most efficient way to do this is to use Renderscript’s C++ API. It requires only a few lines of code and takes advantage of all the available cores. The only trick is to specify the RS_INIT_LOW_LATENCY flag when initializing Renderscript to force it to execute the work on the CPU.
There is one optimization that I wish I had time to implement before I left the Android team. Text pre-caching, asynchronous and partial texture updates are all important optimizations but rasterizing glyphs remains an expensive operation. This can easily be seen in systrace. (Enable the gfx tag and look for precacheText events).
An easy way to optimize the pre-caching pass is to use worker threads to perform glyphs rasterization in the background. This technique is already used to rasterize complex paths that are not rendered as OpenGL geometry.
It should also be possible to improve the batching and merging of text rendering operations. The color used to draw a piece of text is currently sent to the fragment shader as a uniform. This reduces the amount of vertex data sent to the GPU but has the unfortunate side-effect of generating more command batches than necessary: a batch can only contain text of a single color. If the text color was instead stored as a vertex attribute fewer batches could be issued to the GPU.
You can visit libhwui’s GitHub if you want to take a closer look at the font renderer’s implementation. You can start with FontRenderer.cpp where most of the magic happens. Its supporting classes can be found in the font/ sub-directory. You might also find PixelBuffer.cpp useful. It’s a simple abstraction of a pixel buffer that can be backed either by a CPU buffer (a simple uint8_t array) or a GPU buffer (a PBO).
You will notice the use of several configuration properties in the source code. They are all described in Android’s Performance Tuning documentation.
This article only constitutes a brief introduction to Android’s font renderer. There are many details about the implementation that I ignored or would merit further explanations so do not hesitate to ask me questions.