Overcoming the compression limit of the individual sequence using the Set Shaping Theory

Aida Koch
CodeX
Published in
6 min readDec 9, 2023
Photo by markus spiske on Pexels

The complete article can be downloaded at the following link:

Two interesting articles about the set shaping theory:

Introduction

The parameter 𝑁𝐻0(𝑆) is considered the average coding limit of the symbols of a sequence S of random variables i.i.d through a uniquely decipherable and instantaneous code. In article [1] we presented a program that performs a data compression experiment which represents a counterexample of this limit.

Regarding this limit, it is important to specify that when we talk about coding an individual sequence (the source is not known) the compressed message must also include the coding scheme which can be attached to the message or as happens in many universal coding, obtainable from the encoded message. Consequently, an inefficiency is created which has various names: the most common is β€œprice of universality” or β€œinefficiency of entropic coding”. Therefore, exceeding this limit involves a reduction in this inefficiency. This clarification serves to make people understand that exceeding this limit does not lead to absurd results such as, for example, being able to compress a random sequence but simply managing to reduce an inefficiency that occurs when compressing an individual sequence in which the source is not known.

The value of 𝑁𝐻0(𝑆) is considered the coding limit of the sequence S because 𝐻0(𝑆) is considered as the minimum value of self-information per symbol. Travis Gagie in the article [2] reports a theorem that he uses to support this statement. We will analyze this theorem and its proof and highlight a fundamental criticality that may be the cause of the wrong interpretation of this theorem.

Counterexample

The following presentation describes the counterexample very clearly:

Given a sequence S consisting of i.i.d random variables belonging to an alphabet A of length N, the parameter 𝐻0(𝑆) is considered to be the minimum value of self-information per symbol. Therefore, 𝑁𝐻0(𝑆) is considered to be the coding limit, in which the symbols belonging to the alphabet A are replaced by a uniquely decipherable and instantaneous code.

Consequently, this limit has two fundamental constraints:

The first concerns the type of code with which to encode the symbols of the alphabet A which must be uniquely decipherable and instantaneous.

The second concerns the alphabet which must remain unchanged in fact, 𝐻0(𝑆) is related to the symbols belonging to the alphabet A.

Now, let’s build our counterexample starting from these two constraints.

As a coding method, we take Hoffman coding which generates uniquely decipherable and instantaneous codes.

We assume as a condition that in our counterexample the alphabet must never change.

Based on these considerations we have developed the following counterexample:

Generate a random sequence S with uniform distribution (symbol emission probability 1/|A|) with alphabet A and length N.

Calculate the parameter 𝑁𝐻0(𝑆) as defined previously.

We perform a transform on sequence S generating a new sequence f(S) with symbols belonging to A (the alphabet is not changed).

We encode the sequence f(S) using the Huffman encoding (uniquely decipherable and instantaneous code).

We compare the coding limit 𝑁𝐻0(𝑆) of the generated sequence with the length of the encoded transformed sequence f(S).

These steps are repeated a statistically significant number of time and the average value of 𝑁𝐻0(𝑆) and the average values of the length of the encoded transformed sequence f(S) are obtained.

The table shows data obtained with the algorithm described for different values of |A| and N=60.

The first column reports the parameter |A| which indicates the number of symbols of the random sequences generated. The second column reports the average value of the zero-order empirical entropy multiplied by the length of the generated sequence 𝑁𝐻0(𝑆). The third column reports the average value of the number of bits of the transformed sequence f(S) encodes using a code obtained by applying Huffman coding.

All values reported are average values calculated by generating one million sequences.

Table 1: Results obtained for different settings of the parameter |A|, the average values are calculated on one million generated sequences.

As can be seen from the data reported, the length of the encoded sequence f(S) appears to be smaller on average than the value of 𝑁𝐻0(𝑆).

The program is public, and anyone can use it and test it.

https://it.mathworks.com/matlabcentral/fileexchange/115590-test-sst-huffman-coding

Analysis of the theorem regarding the π‘΅π‘―πŸŽ(𝑺) coding limit

In this paragraph, we will analyze theorem number 1 of the article β€œLarge Alphabets and Incompressibility” by Travis Gagie. In the introduction of the chapter β€œUpper bounds” the author makes the following statement:

So, he is stating that when β„“ =0 𝐻0(𝑆) represents the minimum self-information per character. Consequently, 𝑁𝐻0(𝑆) represents the coding limit of the sequence S using a uniquely decipherable and instantaneous code.

In support of this important statement, there is theorem 1 which we report:

The proof begins by considering π‘™π‘œπ‘”(1/π‘ƒπ‘Ÿ[𝑄 π‘’π‘šπ‘–π‘‘π‘  𝑆]) to be π‘šπ‘–π‘›{π‘™π‘œπ‘”(1/π‘ƒπ‘Ÿ[𝑄 π‘’π‘šπ‘–π‘‘π‘  𝑆]):𝑄 𝑖𝑠 π‘™π‘‘β„Žβˆ’π‘œπ‘Ÿπ‘‘π‘’π‘Ÿ π‘€π‘Žπ‘Ÿπ‘˜π‘œπ‘£ π‘π‘Ÿπ‘œπ‘π‘’π‘ π‘ }.

This equivalence implies that there is no transform that manages on average to reduce the value of 𝑁𝐻0(𝑆). This hypothesis is not supported by any proof, and we believe that this point represents a criticality to Travis Gagie theorem.

Being the sequences made up of random variables, the author probably considered it unnecessary to demonstrate the non-existence of a function capable of reducing, on average, the value of 𝑁𝐻0(𝑆).

By applying the set shaping theory, as demonstrated in our counterexample, it is possible to reduce the value of 𝑁𝐻0(𝑆) on average. Consequently, the hypothesis underlying this theorem turns out to be incorrect.

Based on this consideration, The Theorem just proves that 𝐻0(𝑆) represents the minimum self-information per character under the condition that the frequencies of the symbols in S are kept unchanged, therefore not applying any transform.

In conclusion, to define minimum self-information per character we must consider all the possible transforms of S, this important step is not present in the reported theorem.

Conclusions

The purpose of this article is not to give answers but to pose a problem to the scientific community regarding the meaning given to the parameter 𝑁𝐻0(𝑆) which is considered the coding limit of an individual sequence. The counterexample presented is incompatible with this limit. We also analyzed a theorem that proves that 𝐻0(𝑆) represents the minimum self-information per character, showing that this theorem is based on a fundamental hypothesis that is not proven. The author considers true the hypothesis that there are no transforms capable of reducing the value of 𝑁𝐻0(𝑆) on average. Unfortunately, there are no demonstrations to support this hypothesis. The set shaping theory shows that this hypothesis is wrong, there is a transform that manages to reduce the value of NH0(S) on average.

Given the enormous importance that this limit has in information theory, we believe that the time has come for the scientific community to address this problem.

Reference

1. C Schmidt, A Vdberg, A Petit, β€œPractical applications of Set Shaping Theory in Huffman coding”, arXiv preprint arXiv:2208.13020, 2022.

2. T Gagie , β€œLarge alphabets and incompressibility” Information Processing Letters, 2006 β€” Elsevier.

--

--

Aida Koch
CodeX
Writer for

I am Aida Koch, I have a degree in information theory and I am doing my doctorate. I'm passionate about Set Shaping Theory. I still carry on !