Image Compression From Scratch in Python

11 min readAug 23, 2022

In this article I walk through an image compression and decompression pipeline in Python. It uses linear algebra operations to transform an image into a sparse matrix and then uses the inverse operations to re-create the original image. I’ve published the source code from the examples to Github, so feel free to follow along and do some hacking of your own.

Mathematical Background

Before we get into the code, we’ll explore the math behind it. The first step is modeling an image as a two-dimensional matrix. This allows us to use linear algebra to transform the image. The values of the matrix represent the brightness of each pixel and are integers between 0 (black) and 255 (white). If we used color images, we would represent them as three-dimensional matrices, the extra dimension representing the intensities of different components, i.e. red, green, and blue.

The algorithm we’ll use requires the matrix have the same number of rows and columns. Additionally, each stage of the encoding combines pairs of inputs, so the dimensions of the matrix need to be powers of two. Both of these requirements can be satisfied by padding with zeros.

The pipeline uses a Haar wavelet transform to iteratively encode the image as a set of “averages” and “details”. Basically, it summarizes large areas or patterns in the image (averages) along with the values needed to reconstruct the original from the averages (details). Some averages will be of details from an earlier step, which allows summarizing patterns in the differences in an image. Conversely, some details will be differences in the averages from earlier steps.

To the extent pixel values in any quadrant of the image (recursively defined) are either close to the average of the pixel values in their parent quadrant or have similar patterns to other quadrants with the same parent, the transformation will encode the image into a sparse matrix (one with a lot of zeros). The base case, the input image, is modeled as the averages of single pixel values.

We’ll walk through the algorithm using this 4x4 example “image”, labelled with the pixel values:

The first encoding step needs the following properties:

The first and second output columns are the average of the first and second and third and fourth columns, respectively. We can do this with B[i][0] = (A[i][0] + A[i][1]) / 2 and B[i][1] = (A[i][2] + A[i][3]) / 2 .
The third and fourth columns of the output are the difference between the first input column and the first output column, and the difference between the third input column and the second output column, respectively. This simplifies to B[i][2] = (A[i][0] — A[i][1]) / 2 and B[i][3] = (A[i][2] — A[i][3]) / 2 .

We’ll use a matrix multiplication for the transformation. This allows us to efficiently compose and invert transformations. The matrix looks like this:

T1 = np.array([[ 0.5,  0. ,  0.5,  0. ],
               [ 0.5,  0. , -0.5,  0. ],
               [ 0. ,  0.5,  0. ,  0.5],
               [ 0. ,  0.5,  0. , -0.5]])

After applying the transformation (multiplying the input image and transformation matrix), we get this matrix:

Next, we’ll multiply this output by the transpose of the last transformation, on the left this time, which produces the following:

What’s going on here?

The top-left quadrant contains the averages of each quadrant, maintaining the relative positions of the quadrants.
The top-right quadrant contains details. It represents the difference between the average of the left half of each quadrant and the average of the entire quadrant, which is always -8.5. This makes sense, because the original pixel values were increasing by 17 along the x-dimension.
The bottom-left quadrant also contains details. It represents the difference between the average of the top half of each quadrant and the average of the entire quadrant, which is always -34. This also makes sense, because the values were increasing by 68 along the y-dimension.
The bottom-right quadrant tells us the difference between the top left value in each quadrant and the average plus the x-detail and the y-detail (the values mentioned in #2 and #3, respectively). This is always zero, because the values follow a common pattern: A[i][j] = 68 * i + 17 * j. If the values weren’t a linear combination of the row and column, i.e. i * j or i ** 2 , this quadrant would not be empty.

The next and final row transformation needs the following properties:

The first column of each output row is the average of the first two columns of each input row. This is equivalent to B[i][0] = (A[i][0] + A[i][1]) / 2.
The second column of each output row is the difference between the first column and the average of the first and second column. This is equivalent to B[i][1] = (A[i][0] - A[i][1]) / 2.
Columns three and four of the output have the same values as columns three and four of the input.

This is the transformation that satisfies both properties:

T2 = np.array([[0.5, 0.5, 0.0, 0.0],
              [0.5, -.5, 0.0, 0.0],
              [0.0, 0.0, 1.0, 0.0],
              [0.0, 0.0, 0.0, 1.0]])

This is similar to the previous transformation, but only operates on the first two columns of each row. The rest of the information needs to be preserved in order to reconstruct the original image. After applying the transpose of this on the left, only the bottom-right quadrant will be left unchanged . This quadrant had only zeros in our example, so any linear combination of these values would not change them.

Applying this transformation to the output of the previous one, we get the following:

Next, we apply the transpose of this transformation on the left, to get the final encoded image:

What’s going on here?

The top-left value is the average of all the input values.
The value in the first row and second column is the difference between the average of the values in the two leftmost quadrants and the average of the entire image’s values.
The value in the second row and first column is the difference between the average of the values in the two topmost quadrants and the average of the entire image’s values.
The values not in the first row or column are zero. This is because the values are a linear combination of the row and column. Even in a much larger matrix, if this is the case only the values in the first row and column will be nonzero.
In the original, 15 values were nonzero, and now only 7 are nonzero. This would reduce the storage footprint of the image, or allow effectively using SciPy sparse matrices.

Before moving on to decoding, let’s look at a couple other examples, which step through the transformations in the same order as above:

Try to grok those examples before moving on. If you were paying close attention you may have noticed the third input and output are the sum of the first and second inputs and outputs, respectively. This is because matrix multiplications have the distributive property.

We’ll apply the inverses of the encoding matrices in reverse order to get the original matrix. The inverses of linear transformations cancel out the transformations, leaving us with the original values.

This is the inverse of the second encoding transformation:

array([[ 1.,  1.,  0.,  0.],
       [ 1., -1., -0., -0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

This is the inverse of the first encoding transformation:

array([[ 1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.],
       [ 1., -1., -0., -0.],
       [-0., -0.,  1., -1.]])

For brevity, we’ll just directly use the inverses, but if we hand-crafted our own transformations that logically undid the encoding transformations by the logic of combining averages and details, we would create exactly these matrices.

Comparing the decoding transformations with the encoding transformations, they invert the original combinations of the inputs. If this were not possible, the matrix would not be invertible. It’s no coincidence that the Haar wavelet transformation is a lossless encoding and the transformations we’ve used are invertible.

The intuition behind this is instead of storing brightness values in each cell, we store linear combinations of the brightness values. These linear combinations are crafted to be invertible (it’s not much use otherwise) and likely to be zero if the image has certain properties. This is isomorphic to how we can add and subtract equations from one another to find a unique solution for a variable if the system of equations has certain properties. I won’t go on or else this post would become “Everything Max Knows About Linear Algebra”, but I highly recommend checking out Brilliant.org’s courses on the topic if you’re interested in learning more.

Here’s a step-by-step decoding of the encoded image:

And the step-by-step process of decoding the other two examples:

Algorithm

Hello, welcome to the code part of this article. In this section, we’ll create a general purpose algorithm for encoding and decoding any image (but especially cat images without copyrights).

We’ll use Pillow for image I/O, padding the images, and converting them to grayscale. We’ll also use NumPy for our linear algebra routines.

First, we create a function that pre-processes the image:

from PIL import Image, ImageOps


def preprocess_image(image: Image) -> Image:
    """
    Create a square matrix with side length that's a power of 2.
    """
    image = ImageOps.grayscale(image)  # convert to grayscale first so padding is less expensive, might be a large image
    dim = max(image.size)  # Find the largest dimension
    new_dim = 2 ** int(math.ceil(math.log(dim, 2)))  # Find the next power of 2
    return ImageOps.pad(image, (new_dim, new_dim))

Then, we’ll use this function to get each of our Haar wavelet transformations. This generalizes the logic we used to create the example transformations:

def get_haar_step(i: int, k: int) -> np.ndarray:
    transform = np.zeros((2 ** k, 2 ** k))
    # Averages (1)
    for j in range(2 ** (k - i - 1)):
        transform[2 * j, j] = 0.5
        transform[2 * j + 1, j] = 0.5
    # Details (2)
    offset = 2 ** (k - i - 1)
    for j in range(offset): 
        transform[2 * j, offset + j] = 0.5
        transform[2 * j + 1, offset + j] = -0.5
    # Identity (3)
    for j in range(2 ** (k - i), 2 ** k):
        transform[j, j] = 1
    return transform

Step through the columns that will contain “averages” in the output of the transformation. Since each column is the average of two consecutive values, we increment the rows by 2. The weights in the matrix above were both 1/2 for the average, since each output column is the average of two input columns.
Step through the columns that will contain the details of the averages in the input of this step (remember the input image is all averages). We’re subtracting 1/2 of the second input average from 1/2 of the first input average.
Step through the columns to leave unchanged. To do this, we set this part of the matrix’s diagonal to one.

It’s interesting to note the individual transformations will be sparse, unless we compose them.

Now, we compose the inputs:

def get_haar_transform(k: int) -> np.ndarray:
    transform = np.eye(2 ** k)
    for i in range(k):
        transform = transform @ get_haar_step(i, k)
    return transform

This composes the row transformations from each step in the order we want (most detailed to least detailed). Thanks to a useful matrix property, the transpose of the row transformation is equal to the column transformation, leaving us with this function:

def haar_encode(a: np.ndarray) -> np.ndarray:
    k = int(np.ceil(np.log2(len(a))))
    assert a.shape == (k, k)
    row_encoder = get_haar_transform(k)
    return row_encoder.T @ a @ row_encoder

Applying this to our earlier example, we see it does the same thing:

# CodeA = np.array([[0., 17., 34., 51.],
              [68., 85., 102., 119.],
              [136., 153., 170., 187.],
              [204., 221., 238., 255.]])
E = haar_encode(A)
print(E)# Output[[127.5 -17.   -8.5  -8.5]
 [-68.    0.    0.    0. ]
 [-34.    0.    0.    0. ]
 [-34.    0.    0.    0. ]]

To decode the image, we use np.linalg.inv to invert the input transformation:

def haar_decode(a: np.ndarray) -> np.ndarray:
    k = int(np.ceil(np.log2(len(a))))
    assert a.shape == (k, k)
    row_decoder = np.linalg.inv(get_haar_transform(k))
    return row_decoder.T @ a @ row_decoder

Let’s try it out:

# Code>>> A = np.array([[0., 17., 34., 51.],
              [68., 85., 102., 119.],
              [136., 153., 170., 187.],
              [204., 221., 238., 255.]])
>>> E = haar_encode(A)
>>> D = haar_decode(E)
>>> print(D)[[  0.  17.  34.  51.]
 [ 68.  85. 102. 119.]
 [136. 153. 170. 187.]
 [204. 221. 238. 255.]]>>> print(D == A)[[ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

The code works! Notice that the pipeline’s output is exactly equal to the input. We’re going to change that, to get even better compression. First let’s test it out on a cat picture:

We’ll use this function that truncates small nonzero values in the encoded matrix to zero, in order to better compress the matrix.

def truncate_values(a: np.ndarray, threshold: float):
    return np.where(np.abs(a) < threshold, 0, a)

Thresholds in the range of 1–8 (the units of the threshold are brightness values) seem to give reasonable combinations of compression and image quality.

Here is the final pipeline:

im = preprocess_image(im)
A = np.array(im)
E = haar_encode(A)
threshold = 8
E = truncate_values(E, threshold)
D = haar_decode(E)

Here are some examples of images after they’ve been encoded, lossily truncated, and decoded using different thresholds for truncation:

12 is the highest tolerance that’s recognizable as a cat, but not necessarily the one in the picture. It achieves a 52:1 compression ratio. Here’s the decoded image:

Conclusion

In this post we did a deep dive into the math behind the Haar wavelet transform and created a respectable image compression and decompression suite. Thanks for reading!

Image Compression From Scratch in Python

Mathematical Background

Algorithm

Conclusion

Written by Maxwell Conradt