BB-ANS coding

Benjamin Bodner
BeyondMinds
Published in
14 min readMay 4, 2020

“Decoding” the State-of-the-art Coding Method

Images generated by decoding 15%, 30%, 60% and 100% of the compressed data, using Integer Discrete Flows. Image from IDF

1. Introduction

I’ve been fascinated by the field of lossless compression for quite some time, and recently got a great opportunity to dive deep into it, to learn about an exciting recent breakthrough. As part of our job, Hai Morgenstern and I started playing around with lossless compression algorithms, and have stumbled across the state-of-the-art algorithm called BB-ANS [Townsend 2019a]. In several recent works, BB-ANS-based methods have achieved record-breaking compression rates for image compression, up to 30% better than the widely-used lossless compression algorithm, bzip2! [Townsend 2019a, Kingma 2019, Ho 2019]

Neural networks are at the heart of this breakthrough, as their ability to capture complex relationships in real-world data enables this new method to be used for a wide variety of applications (such as images, videos, and more). Though BB-ANS was hard to grasp at first, it comprises several algorithms, which are highly theoretical and some counter-intuitive, which interact brilliantly. After an extensive survey based on seminal papers that date back to the 90’s(!), we are happy to share our understanding of this elegant algorithm and to try to give the complete picture using pseudo-code explanations and simplifying schematic diagrams.

Demonstration of lossless and lossy image compression. Image from Bangtan

The rest of the blog is arranged as follows. First, some background regarding entropy coding, and in particular Huffman coding, is provided. This is followed by the algorithms that construct BB-ANS [Townsend 2019a]; namely, ranged asymmetric numeral systems (rANS) [Duda 2013], streaming rANS, the bits-back (BB) argument [Frey and Hinton 1997], and chaining BB [Frey 1998]. In particular, chaining BB is demonstrated with an rANS stream, which gives the ultimate BB-ANS [Townsend 2019a].

2. Entropy coding

Lossless compression methods aim to compress files (such as images, videos, text ) to the minimal amount of bits, without losing any information in the process. Entropy coding is a lossless compression scheme, which aims to achieve this goal by utilizing the probabilities of the objects we wish to compress (known as “symbols”).

Figure 1: Assignment of binary identifiers for different letters, using a Huffman tree. Image from GatVidyalay

The symbols to be encoded (or decoded), are from a finite set of symbols sᵢ, and occur with a known probability. Compression performance is typically evaluated based on two metrics: compression rate and compression speed, which we will use for comparing the different methods surveyed in this blog. The optimal compression rate that any entropy coding method can achieve is known as the Shanon entropy of the data, which can be calculated from the probabilities of the symbols. (check out this blog for more details). Amongst entropy coding methods, in Huffman coding, a well-known method, each symbol in a sequence is encoded individually, using a prefix-free coding scheme. Encoding each symbol at a time renders Huffman the entropy method with the highest compression speed. However, the prefix-free coding scheme also leads to sub-optimal compression rates when the probabilities of occurrence of some symbols are not inverse powers of two. This is opposed to methods that encode symbols into a state, such as rANS coding, presented in the next section.

3. rANS coding

A recent entropy-based method for coding, Asymmetric Numeral Systems (ANS), has been proposed by Jarek Duda in 2009 [Duda 2013]. This method, which is at the heart of BB-ANS, achieves better compression rates than Huffman and at comparable speeds! In this blog, we will focus on ranged ANS (rANS), covered well in reference [Tatwawadi 2017 ]. The key idea of rANS coding is to model a sequence of symbols as a state x, which is expressed as a single integer number. This state is updated every time a new symbol is encoded/decoded using computationally efficient operations (discussed below), making it a desirable coding scheme for deployment in real-world applications.

Figure 2: Illustration of changes in message length, when encoding/decoding symbols. Image from Bit-Swap

In rANS coding, a sequence of N symbols, s₁, s₂, …, sₙ, is encoded sequentially into the state x, starting from an initial state x₀. This initial state is typically set as x₀ = 0, and can be set differently to encrypt communicated data (please refer to [Duda 2013] for more details). The probability of each symbol, p(s), is assumed to be i.i.d, and, in this setting, is known to both the sender and the receiver. Algorithm 1 demonstrates the encoding of the i’th symbol, sᵢ, to the state at the previous discrete index i−1, x_{i−1}, resulting in an updated state, xᵢ.

As this coding method is entropy-based, p(s) is used for constructing the encoding operation C.

Decoding a symbol from state xᵢ gives as an output both sᵢ and the previous state, x_{i−1}, as described in Algorithm 2.

As for C, the decoding operation D is also based on p(s).

To conclude, this section presented the rANS coding and decoding algorithms. rANS coding scheme achieves close to optimal compression rates, and compression speeds that are slightly lower than Huffman coding. Moreover, the exact manner of how C and D perform coding and decoding, while interesting, is not required for understanding the construction of BB-ANS, and therefore not covered in this blog. Please refer to [Duda 2013], [Tatwawadi 2017 ] for more details.

More about rANS

Note also that:

(i) Though p(s) is typically known to both sender and receiver, it is written explicitly as an input argument in Algorithms 1 and 2. This is because different probabilities will be used for encoding and decoding as part of the BB-ANS algorithm, as will be apparent soon.

(ii) The encoded message length in this case is given by:

L = len(binary(x))

For individual symbols, the expected length is:

L/N ≈ −E[log(p(s))]

This means that the rANS coding scheme is indeed close to optimal.

(iii) Symbols are encoded and then decoded in a last-in-first-out (LIFO) manner. The LIFO processing nature renders rANS to be compatible with BB-ANS, as demonstrated in a later section. This is opposed to a related entropy-coding method, arithmetic coding, that, while achieving compression rates similar to rANS (yet at slower compression speeds), is not compatible with BB-ANS as it employs a first-in-first-out (FIFO) coding scheme [Witten 1987].

Compressed paper. Image by JJ Ying on Unisplash

Though rANS coding achieves close to optimal compression rates, it is not compatible with finite-bit digital representations; for long sequences of symbols, the state may eventually become too large to be represented using a finite-bit representation, which leads to overflow. To handle this issue, streaming rANS coding [Duda 2013], [Tatwawadi 2017 ] has been proposed, and is presented in the next section.

4. Streaming rANS coding

Streaming rANS avoids the problem of overflow by maintaining the rANS state within a predefined “decodable” range of values. This is done by systematically transferring bits from a binary representation of a state to a stream of bits, such that the new state representation can facilitate coding or decoding of additional symbols without overflow. In particular, the state xᵢ in the standard rANS coding (see Algorithm 1), is now represented by a bit stack, Bᵢ, with B₀ initialized as an empty bit stack.

To encode a new symbol, sᵢ, the state vector is initially extracted from B_{i−1} and then modified in iterations, so that encoding sᵢ will not lead to an overflow in the modified state representation. Algorithm 3 presents the encoding of the i’th symbol, sᵢ, to the bit stack at the previous discrete index i−1, B_{i−1}, using the probability distribution p(s), which results in an updated bit stack, Bᵢ.

A specific choice of the decoding range I = [2³¹, 2³²−1], was chosen for a 32-bit integer representation, for consistency with the original BB-ANS paper [Townsend 2019a].

As seen in Algorithm 3, x_{i−1} is initially extracted in line 3, and then represented as an integer in line 4. The state is then modified in the while loop in lines 5–8, until encoding sᵢ does not lead to an overflow. In each iteration, a bit is pushed to the bit stream (line 6) from the state. Hence, the number of bits added to the bit stream during encoding is equal to the number of iterations of the while loop. After exiting the while loop, in line 8 the new symbol is encoded onto the modified state (without overflow!) in line 9, resulting in a new state, which is then pushed onto the bit stack in line 10.

Decoding a symbol from a stack Bᵢ is described in Algorithm 4. It extracts symbol sᵢ from the bit stack at the discrete index i, Bᵢ, and returns sᵢ and B_{i−1}.

As seen in Algorithm 4, xᵢ is initially popped off the bit stack in line 3, and then transformed to an integer representation in line 4. Symbol sᵢ is then decoded in line 5 given the state, resulting in the previous state x_{i−1}. x_{i−1} is then modified to be within the decoding range, in order to enable further encoding or decoding, in the while loop in lines 6–9. In each iteration, a single bit is popped from the stack. Hence, the overall number of bits removed during decoding is equal to the number of iterations of the while loop. The encoding and decoding of two subsequent symbols in a sequence using streaming rANS algorithms is presented in Figure 3.

Figure 3: Illustration of coding and decoding of two symbols using streaming rANS coding [Duda, 2013].

Figure 3 demonstrates that the encoding and decoding operations are performed in the opposite order, revealing the LIFO nature of this algorithm. Since this is a near-optimal entropy coding scheme, the approximate change in the length of the bit-stream B after encoding/decoding symbol sᵢ is:

ΔL ≈ −log(P(sᵢ))

As in the case of standard rANS compression, streaming rANS achieves close to optimal compression rates. Moreover, it facilitates encoding while avoiding overflow in finite-bit representations. However, the additional operations required to prevent overflow render slightly lower compression speeds compared to standard rANS. Finally, there are also variants of streaming rANS coding, in which multiple bits are added or deleted from the stack within the while loops in the respective algorithms [Tatwawadi 2017 ], which are out of the scope of this blog. Since one of its main assumptions is that all symbols are i.i.d, streaming rANS coding achieves sub-optimal compression rates in scenarios that contain dependencies between consecutive encodings. Such dependencies are especially important in scenarios which involve latent variable models (to be discussed in the next section) and will require further improvements to the coding algorithms, if we wish to utilize them.

5. BB-ANS = Chaining Bits-back coding of an rANS stream

Figure 4: BB ANS coding as compared to other lossless compression methods. Image from Townsend 2019a

In this section, chaining bits-back coding is applied to an rANS stream, giving the ultimate BB-ANS algorithm [Townsend 2019a]. Initially, the bits-back argument is briefly discussed, as it is employed for chaining bits-back.

The bits-back argument [Wallace 1990, Frey and Hinton 1997] was, for us, counter-intuitive at first. It proposes to employ a one-to-many source-coding scheme, arguing that the stochastic complexity can be used to improve compression rates [Rissanen 1989], as such a scheme enables the transmission of side information for “free”.

The argument is focused on lossless compression using latent variable models, which enable a symbol s to be encoded more effectively when conditioned on a corresponding latent variable y. It proposes a method for utilizing conditional probabilities for encoding s (p(s|y) rather than just p(s)), by sending y along with it, but in such a way that barely increases the message length! For every symbol s, a corresponding latent variable y is chosen out of a posterior distribution q(y|s), using some “side information” which we can think of as an initial bit stack. This “sampling” process removes information from the stack and reduces its length. It means that when we start encoding s and y, we are starting with a shorter bit stack, which reduces the effective message length for encoding the symbol s. This decrease in the length tends to (almost) cancel out the increase caused by encoding the latent variable y, enabling us to effectively encode it for free! To recover this “lost” information, the decoder is instructed to encode the missing information back onto the stack, by performing the inverse operation used by the encoder during the “sampling” process — which is how we get our Bits Back!

Image by Tomas Sobek at Unisplash

Chaining bits-back [Frey 1998], was proposed later, and presented a method to exploit the bits-back argument, for a single stream of information (i.e., without additional side information). This is achieved using an elegant daisy-chain manipulation of a bit stream, as is described later in this section. Chaining bits-back applied on an rANS stream gives the BB-ANS algorithm (finally!). As in the case of streaming rANS coding, symbols are encoded to a bit stream. B₀ is populated with an initiation bit stream, as opposed to rANS streaming, which will be addressed in more detail at the end of this section.

Algorithm 5 presents encoding a new symbol sᵢ onto B_{i−1} using BB-ANS coding, resulting in an updated bit stack Bᵢ. Though the operations on the bit stack are different than in Algorithm 3, the inputs and outputs are the same.

The probability distributions q(y|sᵢ) and p(s|yᵢ), which are fetched in lines 2 and 4, respectively, are typically learned using latent variable models such as variational autoencoders (VAE). In the BB-ANS setting, both sender and receiver have copies of the same pre-trained models, and thus can compute the same probability distributions needed for encoding/decoding. In addition, for compatibility with finite-length dictionaries (of the symbols), quantization is applied to the latent spaces of these models [Townsend 2019a].

Figure 5 : Illustration of encoding a symbol (an image of a flower) and its corresponding latent variable (the “random” pixels) to a bit-stack. Image from HiLLoC

After q(y|sᵢ) is computed, it is fed as an input to streaming_rANS_decode in line 3, along with B_{i−1}. This is where the magic happens: applying streaming_rANS_decode “peels off” bits from B_{i−1}, which are used to choose the latent variable yᵢ from q(y|sᵢ)! These are the bits that we will get back, as will be demonstrated in the BB_ANS_decode algorithm and the schematic diagrams soon to be presented. After yᵢ is chosen according to the “peeled-off” bits, sᵢ conditioned on yᵢ is encoded onto the stack in line 5 using learned probability P(s|yᵢ), which is fetched in line 4. Then yᵢ is encoded onto the stack in line 7 of using a prior probability distribution P(y), known to both sender and receiver, which gives the updated Bᵢ in line 8. P(y) is typically chosen as a normal distribution, which maximizes the entropy among all y distributions with fixed variance. Note that the initial decoding in line 3 of Algorithm 5 is the reason that B₀ is initialized as a non-empty bit stack (compared to the empty B₀ in the case of an rANS stream). The number of bits required typically depends on the entropy of q(y|s) and the architecture of the latent variable model [Townsend 2019a]. As these initial bits are non-informative, variants of BB-ANS have been proposed to decrease the amount of these bits, in scenarios involving hierarchical latent variable models [Kingma 2019, Townsend 2019b, Ho 2019].

Decoding symbol sᵢ from Bᵢ using the BB-ANS decoding is presented in Algorithm 6. It performs the inverse operations to those performed in Algorithm 5, returning the extracted symbol sᵢ and B_{i−1}.

As seen in the algorithm, initially the latent variable yᵢ is decoded from the stack in line 3, using probability P(y) fetched in line 2. Then, sᵢ is decoded from the stack in line 5, using probability P(s|yᵢ) fetched in line 5. Finally, the magic happens again: after both yᵢ and sᵢ have been successfully decoded, it is possible to restore the bits that were initially removed by the encoder, by performing the inverse operation that was used. Since the bits were removed by decoding yᵢ, using q(y|sᵢ), (this was the “sampling” process), to restore those bits, we need only to encode yᵢ back onto the stack using q(y|sᵢ). This ensures we return the exact state the bit stack was in before sᵢ was encoded, B_{i-1}, which means:

We get our bits back!

In Algorithm 6, q(y|sᵢ) is fetched in line 6, is then used to encode the yᵢ onto the stack in line 7.

The encoding and decoding of two subsequent symbols in a sequence using BB-ANS is presented in Figure 6.

Figure 6: Illustration of coding and decoding of two symbols using BB-ANS coding [Townsend, 2019a].

In Figure 6, it is evident that three steps are required for encoding each symbol, as opposed to rANS streaming in Figure 3, where only one operation is required for encoding and decoding. Though the encoding operation is more complex, it is nonetheless uniquely decodable, since the operations are fully reversible. Same as for streaming rANS coding, the LIFO nature of BB-ANS is also apparent in the figure.

To conclude, this section presented BB-ANS coding and decoding. The average length of a message using this method is given by the average length of code E[log(p(s|y))] (with E[·] being the expectation operator) plus the average length of code E[log(p(y))] minus the bits that were initially “peeled off” with an average length of E[log(q(y|s))]. Putting it all together, the average increase in message length associated with sending a single symbols is:

ΔL = E[log(p(s|y))] + E[log(p(y))] − E[log(q(y|s))]

This effective average code-word length is referred to as the “free energy” in [Frey and Hinton 1996]. Note that since y is “sampled” using E[log(q(y|s))] and encoded using p(y), the effective increase in message length associated with sending y is:

ΔL = E[log(p(y))] − E[log(q(y|s))] > 0

which is actually the KL divergence of the two distributions and is typically small, compared to E[log(p(s|y))]. BB-ANS-based methods achieve the best compression rates compared to other lossless compression schemes. However, the compression rates depend on the quality of learned probabilistic models (q(y|sᵢ) and p(s|yᵢ)); which is why recent efforts in the field, have focused on improving them by employing advanced neural network architectures, to efficiently approximate the probability distributions of the data. Furthermore, their compression speeds are typically lower than regular rANS coding, as BB-ANS based methods require more operations (and is apparent in Figure 6). Performance of several recent BB ANS based methods as compared to generic methods are displayed in Figure 7.

Figure 7: Performance of different compression methods, measured in average bits per pixel (8 is uncompressed), as evaluated on the test sets of several computer vision benchmarks. (IDF, LBB, Bit-swap and HiLLoC are BB ANS based method). Image taken from HiLLoC

As seen in Figure 7, BB ANS based methods outperform generic compression methods such as PNG, WebP and FLIF, across all benchmarks. For the Cifar10 benchmark, LBB (a variant of BB ANS) achieves compression rates 46.8% percent better than PNG and 32.3% better than WebP! These results demonstrate the significance of this breakthrough and highlight the differences between compression methods.

image by Clay Banks at Unisplash

6. Conclusion

This blog has presented BB-ANS coding and decoding, including schematic diagrams for demonstration.

The remarkable compatibility of neural network-based latent variable models with BB-ANS compression enables these methods to achieve state-of-the-art compression rates. Though, from our own experience, the ability to apply these methods in real-world applications requires an ‘’under the hood’’ understanding of their inner workings. We hope this blog will help others apply these methods and keep advancing this fascinating new field of research!

--

--