Building a blazing fast ETC2 compressor
TL;DR — We’ve released a fast, open source ETC2 compressor :ETC2Comp. If you’re interested in how it works, read further ;)
Those of you who follow along with my ramblings know that my early career was steeped deep in the gaming industry where I focused dominantly, on Performance, GPUs, Tooling, and Data Compression. I’ve continued this work at Google, where (surprise!) I spend a lot of time talking to developers about those exact topics.
That’s why, over the past few months, I’ve been keeping a very keen eye on the VR space: VR is a BIG thing. It’s one of those platform shifts that we, as an industry, experience every decade or so, causing lots of new awesomeness, but also creating lots of new problems that need solving.
VR, Textures and ETC2
One of the big problems that’s been on my mind about the potential for VR on mobile devices has been ETC textures. See, standard high-end games use a lot of these things, and VR will only cause this to explode in order to provide more immersive experiences for people. Basically, as content developers strive to create more immersive, high-detail experiences in VR, the texture size and quantity will increase.
Now, if you’ve ever worked with ETC2 textures for games or VR on mobile devices ( ETC2 is a standard format for OpenGLES 3.0), you know that there’s one massive problem: Encoding a high quality ETC2 texture with low error rates) takes a long time.
To give an example of the times involved when compressing these textures, let’s take a look at the Mali GPU Texture Compression tool, which happens to be one of the more popular ETC2 texture compressors used in video game development today. Below is a chart showing how long Mali took in slow mode to encode a random set of 100 textures (using Mali with the “slow” flag).
If you consider the time it takes to encode a JPG file as “normal” then the test above shows that very few of the textures actually fell in the space of “normal” time:
- The average encoding time was a whooping 640 seconds, that is, 10 minutes!.
- The median time was 409 seconds.
- The fastest encoding of a texture in the set was 1.36 seconds.
- The slowest encoding took 4,780 seconds, which is 1.3 hours.
My experiments show that the average high-end Unity game contains around 500–1500 textures (depending on the style of the game and it’s visual needs), and if you’re expecting that to increase by 2x-4x (like I am) with the movement to VR, this chart suddenly looks like a very large problem with respect to production times. Taking 640 seconds for each of the 1500 textures for a build isn’t my idea of a fast turn-around.
Introducing ETC2Comp: a fast ETC2 encoder
In order to address this problem of texture compression times, a few of us at Google (Romain Guy, Paul Rashidi, and myself) have been working with the engineers at Blue Shift Inc. on creating a fast, stable, high-quality open source ETC2 compression tool.
So today, we’d like to introduce ETC2Comp, a blazing fast ETC2 texture encoder.
ETC2Comp can be used as a standalone executable on OS X, Linux, and Windows, as well as a library embedded in existing engines/tools.
How much faster is ETC2Comp when compared to Mali? Let’s chart the same random set of images and put the two compressors head-to-head:
For the ETC2Comp set:
- The average time was 10 seconds.
- The median time was 5 seconds.
- The fastest time was 2.7 seconds.
- The slowest time was 91 seconds.
Only for 8 of the 100 textures Mali encoded faster than ETC2Comp. The rest showed a clear win with the new encoder. The worst case for ETC2Comp was significantly faster than the average case for Mali.
Note that ETC2Comp was run with command parameters effort=40 and threads=8. I’ll describe what the ‘effort’ value means later in this article.
Now, if you’re worried that you might be trading faster performance for worse compression quality, let’s take a look at the PSNR of the images for the set:
The Gist: ETC2Comp produces high-quality images in significantly fast time.
UPDATE: Since the launch of this article, I’ve been asked for the logscale version of the encoding-time graph; here it is:
Sanity check: What is Mali doing?
So wait, what’s going on here? Some of the Mali numbers seem absurd, can we figure out why its performance is so skewed?
[WARNING] We don’t have access to the Mali source code, so we are not at all sure how the code is actually working. I want to make 100% clear that we can only infer what Mali does, so anything said here is just opinion and observation. [WARNING]
With that in mind, the team observed the following behavior with Mali:
- Mali appears to be single-core, single-threaded.
- The tool appears to have an exhaustive encoding strategy for ETC1 images (in slow settings mode).
- On ETC2 images, the encoding search space is too large to search exhaustively, but a lot of image quality comes from ETC2, so Mali appears to be doing a partial brute-force search for ETC2 blocks.
- In ETC2 mode, Mali doesn’t find all the potential combinations, and when it does, it takes a long time to get there.
- In our tests, the tool exhibited a few repeatable bugs that generate bad/corrupted images.
To be clear, Mali has been a de-facto standard in the games industry since ETC became a standardized format; and we have to give it a lot of respect for that. It’s held up the flag of texture compression for the better part of a decade now.
Why is ETC2Comp so much faster?
The folks over at Blue Shift Inc have been brewing high-quality texture compressors for a long time now; so they were able to put a bit of dark magic into the encoder to get the speed gains.
At a high level, the speed gains come from three areas:
1) A directed block-search: Rather than a brute-force search, ETC2Comp uses a much more limited, targeted search that is based on block type.
2) A full-effort setting: Mali contains only fast/slow modes, while ETC2Comp contains a scalar value which lets you adjust how much effort you want in search-space exploration.
3) Highly multi-threaded code. In this case, 2 threads are better than 1.
#1 is the dark magic part, which I’d like to dig into a bit more.
0. Understanding the problem
ETC2s format supports a different type of encoding for each 4x4 block in the image. For example Planar mode, Differential mode, Individual Mode, H mode, T mode, etc. ( “ETC2: Texture Compression using Invalid Combinations” goes into more detail about the blocks and how they work.)
The challenge in creating a high-quality ETC2-encoded texture is figuring out, for each block of the image, which block-format will yield the best visual result.
For a 1024x1024 image, that’s 256x256 blocks, where each one can independently be one of 10 separate formats. In total, that’s 10⁶⁵⁵³⁶ potential combinations for the encoder to search through in order to find the best-possible-quality image.
Simply put, the search space is much too large to attempt to test exhaustively, and this is the largest issue that directly relates to the performance of our encoder; if we want to improve the time it takes to compress textures, we need to minimize the search space.
1. Generating the proto-block-chains, offline
Rather than searching the whole space for each block, ETC2Comp instead defines a set of directed graphs through the potential space. Each graph defines a sequential ordering of modes, based on how likely the mode is to fit a specific block-archetype.
The block-archetype setup allows us to take advantage of something we explicitly know about ETC2 encoding: Given a type of input block, it’s more likely to find an optimal solution in certain block-formats than others. For example, a block all the same color should immediately be encoded in Planar mode; there shouldn’t be a need to check the other 9 combinations. A block with alpha will have a different encoding than one without alpha.The same goes for other types of blocks as well; there’s a bias towards what type of encoding will be a best fit.
When we encode an image, each block of 4x4 pixels is fitted to one of the block-archetypes, and thus assigned a chain of potential modes to test against (more on this later).
ETC2Comp calculated these archetype blocks and their potential proto-block-chains offline. A huge texture suite was ran through the encoder system, and the matched results were hard-coded into the encoder as a snapshot of the state of the universe.
You can think of this stage as a massive training step: A bunch of input data is fed into a system, and the configurations come out, which are then used by the encoder.
To be clear: This step only happened once (offline, during development), and the results of this data is built into the encoder itself. This stage is not run when you encode a texture.
2. Block scoring, sorting, and assigning
Let’s say we have the following 16x16 pixel input texture:
This texture contains 4x4 ETC2 blocks (of 4x4 pixels each)
Once the encoder receives this image, the first thing it does is subdivide it into blocks, and calculate each block’s visual score (using MSE). The blocks are then sorted by their visual score, such that the worst-looking blocks are sorted towards the front of the list.
This sorting process, based on visual score, happens several times in the encoding process, once for each pass through the image (we’ll talk about this in a little bit).
Each block is then tested against the block archetypes, and assigned a format-chain that best matches it. The block will keep this chain through the entire encoding process, regardless of where it may end up in the sorting process over time.
3. Block testing
For each block in our list, our goal is to iterate properly through its proto-chain, and determine which mode is the best match for this block. When the encoding starts, each block defaults to the first format in its proto-chain. As iteration occurs, the source block is encoded using the next mode in the chain, and the visual result is compared to the current-best found version.
Once the encoder tests the current block against it’s next mode (and either found to be the best version, or not) the encoder moves on to the next block in our list.
Encoding of the blocks happens in passes. It’s important to note that the encoder will only test ONE mode for the block during each pass.
This detail is important, because the order of blocks in the proto-chain was predetermined in the offline computation step, with the most likely matches towards the front of the chain. If we find a really good match, we may not need to touch this block any further (and get a huge speed boost).
Once all the blocks in the image have been tested, we go back to step 2, where all the blocks are re-sorted, and assigned their updated quality value (if it has changed). After the re-sort, the new “worst-looking blocks” are at the front and can be addressed with priority.
The result of this iteration stage is that a single pass will improve the quality of the image overall, spending the most time on the worst blocks (towards the front of the list).
5. The effort parameter
Having the worst blocks at the front of the list is an important step, because it allows us to optionally stop testing blocks that may look “good enough” and not test them any further. (In general, once a block hits a “goodness” threshold from a visual perspective, any improvements will be minimal compared to the cost of more block testing, which could be expensive.)
To control the amount of testing, the encoder’s primary parameter is the EFFORT param. This parameter defines a percentage of how many of the blocks you want to test against in each pass, before the pass finishes.
For example, passing an effort value of 45% means that only the worst 45% of blocks will be processed on each pass.
Setting the effort value to 100% produces a setup where all the blocks of the image are tested on each pass (but that’s kinda overkill).
In my tests of ETC2Comp, setting the effort level to 60% results in roughly the same visual quality as Mali (in slow mode), but is significantly faster. Basically, on each pass, only the 60% worst-looking blocks are touched; we’re saving 40% performance for each pass through the image.
6. Knowing when to stop
The effort parameter controls how many of the worst looking blocks are tested in each pass.
The encoder will continue to work until all blocks in the error threshold for this pass have exhausted their entire format-chain of options. At that point, it quits.
Basically, if in the current pass, all the worst-looking blocks have exhausted their proto-chains, then they can’t look any better than this, and future re-sorts will result in the same ordering of blocks. As such, we’ve reached the best quality we’re going to find, so no need to continue.
7. Threading the hell out of it.
On each pass, the blocks which are going to be worked on (determined by their visual quality and the current effort level) are passed off to a thread pool, where each block can do its test in an independent and parallel manner. This is a basic fork-and-join model of threaded computing. It’s not really fancy, but it does allow ETC2Comp to scale to 32-core machines without throwing a fit. (Note that you can specify how many threads you want the tool to use with a command line parameter.)
We built ETC2Comp to empower game and VR developers with a new tool to minimize their development burdens. It’s got a beautiful mix of solid engineering, cutting-edge techniques, and a bit of dark-juju magic. The combination of which will hopefully give you some help with your next project.
We invite you to jump over to the github repo, grab the source code, play around with it, and let us know what you think.
But wait, there’s more!
Ever use WebP? Interested in how that works?