VQGAN. From image reconstruction to new image generation step by step.

Olga Mindlina
11 min readDec 15, 2023

--

The object of this article is VQGAN as a whole system for new image generation. I’ve already started the discussion of the part of VQGAN — autoencoder (VQVAE: Vector Quantized Variational Auto Encoder) — here. The concept of VQVAE is simultaneous training of Encoder, Decoder and codebook, which is common for all possible images. The codebook is a set of 256-embeddings vectors. A latent space for any image with the input resolution 256x256 is represented by some subset of codebook vectors. The illustration of VQVAE pipeline is in the bottom part of the picture below (the picture is from the article):

Fig. 1

The encoder transforms the input image (256x256 pixel resolution) to the latent space with a plane 16x16 entries, each entry is a vector of 256 values (in the schema in Fig. 1 the latent space is shown as a plane 4x4 entries). Then each entry in the latent space is changed to the nearest in L2-measure vector from the codebook — this process is called vector quantization. Thus the latent space is represented by the plane 16x16 of codebook indexes. Sending this quantized latent space to the decoder we obtain the reconstructed image.

In VQGAN, the autoencoder part is expanded by an additional CNN — a patch-based Discriminator (see the picture). Discriminator has a classifier structure. In the picture it is shown the interaction between VQVAE and Discriminator: after the image is reconstructed it is sent to Discriminator, and Discriminator produces class values for the image patches. Discriminator obtains “classes per patch” spaces for both input and reconstructed images and verifies the class-difference between these spaces on each patch: the same class (real) or not the same class (fake). Discriminator participates in VQGAN training and tries to maximize its loss, but the common loss = VQVAE loss + the loss from Discriminator are minimized. A good explanation of VQGAN loss composition is here. Discriminator is not used when the trained model makes an image reconstruction, it is used for the improvement of VQVAE quality in the training step. Discriminator plays an important role in the next step of GAN training for new image generation.

Practical experiments with the latent space

In this section I demonstrate image reconstruction with VQGAN in practice, and experiment with the latent space, codebook and their role in generation of new images. Many parts of the code below came from this google colab. Here, I use the following code in my google colab.

Imports:

import copy
import cv2
import sys

import torch

from PIL import Image
from torchvision import transforms

import matplotlib.pyplot as plt
import numpy as np

Google drive mapping:

from google.colab import drive
drive.mount('/content/gdrive')

Cuda device setting:

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

Installations of VQGAN and the model downloading (the code from this google colab):

%pip install omegaconf>=2.0.0 pytorch-lightning>=1.0.8 einops>=0.3.0
sys.path.append(".")

!git clone https://github.com/CompVis/taming-transformers
%cd taming-transformers

# download a VQGAN with f=16 (16x compression per spatial dimension) and with a codebook with 1024 entries
!mkdir -p logs/vqgan_imagenet_f16_1024/checkpoints
!mkdir -p logs/vqgan_imagenet_f16_1024/configs
!wget 'https://heibox.uni-heidelberg.de/f/140747ba53464f49b476/?dl=1' -O 'logs/vqgan_imagenet_f16_1024/checkpoints/last.ckpt'
!wget 'https://heibox.uni-heidelberg.de/f/6ecf2af6c658432c8298/?dl=1' -O 'logs/vqgan_imagenet_f16_1024/configs/model.yaml'
# also disable grad to save memory
torch.set_grad_enabled(False)

In the code above the smallest model with number of codebooks entries = 1024 is installed.

Two utility functions for image reading from file and transforming to torch tensor and for showing input and output images:

def get_img_tensor(name,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Resize((256, 256)),
])):
img = Image.open(name)
img = transform(img)
img = img.unsqueeze(0)
return img

def show_results(img, out):
rec = custom_to_pil(out[0])
_, ax = plt.subplots(1, 2, figsize=(12, 5))
if img is not None:
ax[0].imshow(img[0].permute(1, 2, 0))
ax[0].axis("off")
ax[1].imshow(rec)
ax[1].axis("off")
plt.show()

The code below contains functions for using VQGAN for image reconstruction (the code is from this google colab):

from omegaconf import OmegaConf
from taming.models.vqgan import VQModel

def load_config(config_path):
config = OmegaConf.load(config_path)
return config

def load_vqgan(config, ckpt_path=None):
model = VQModel(**config.model.params)
if ckpt_path is not None:
sd = torch.load(ckpt_path, map_location="cpu")["state_dict"]
missing, unexpected = model.load_state_dict(sd, strict=False)
return model.eval()

def preprocess_vqgan(x):
x = 2.*x - 1.
return x

def custom_to_pil(x):
x = x.detach().cpu()
x = torch.clamp(x, -1., 1.)
x = (x + 1.)/2.
x = x.permute(1, 2, 0).numpy()
x = (255*x).astype(np.uint8)
x = Image.fromarray(x)
if not x.mode == "RGB":
x = x.convert("RGB")
return x

def reconstruct_with_vqgan(x, model):
# could also use model(x) for reconstruction but use explicit encoding and decoding here
z, _, [_, _, indices] = model.encode(x)
print(f"VQGAN --- {model.__class__.__name__}: latent shape: {z.shape[2:]}")
xrec = model.decode(z)
return xrec

load_config(), load_vqgan() — the functions for pre-trained model loading.

preprocess_vqgan() — the function for input image-tensor preprocessing before sending it to VQGAN Encoder.

custom_to_pil() — the function for reconstructed image-tensor post-processing after VQGAN Decoder.

reconstruct_with_vqgan() — the function for image reconstruction: it calls Encoder, gets image latent space, and then calls Decoder to obtain the reconstructed image.

Note: a known issue — an error in the line “from taming.models.vqgan import VQModel” may occurs. A quick fix is here.

Now all is ready for the reconstruction. Load the pre-trained model:

config1024 = load_config("logs/vqgan_imagenet_f16_1024/configs/model.yaml")
model1024 = load_vqgan(config1024, ckpt_path="logs/vqgan_imagenet_f16_1024/checkpoints/last.ckpt").to(device)

And use the model for image reconstruction:

img = get_img_tensor("image path")
out = reconstruct_with_vqgan(preprocess_vqgan(img.to(device)), model1024)
show_results(img, out)

The example of image reconstruction result:

Fig. 2

reconstruct_with_vqgan() calls Encoder and Decoder consecutively. Let’s look at these functions:

z, _, [_, _, indices] = model1024.encode(preprocess_vqgan(img.to(device)))
indices = indices.detach().cpu().numpy()

Pre-trained encoder model returns the latent space z with a shape (1, 256, 16, 16) and codebook vectors indices with a shape (256). The codebook-vectors with these indices compose the latent space if put them in raster order in the plane 16x16. In other word if I had 256 indexes in a proper order I would able to create a latent space from the codebook and reconstruct the picture calling Decoder. In the code below I try this.

First, obtain codebook vectors:

ind = torch.arange(1024).to(device)
cb = model1024.quantize.get_codebook_entry(ind, None)
print(cb.shape)

In the code above I get vectors with indices 0, … 1023 from the codebook, i.e. the whole codebook (I use the small VQGAN model). The shape of the codebook is (1024, 256).

The function below shows how to create a latent space from the codebook + numpy array of 256 indexes and to get the output picture using Decoder:

def cb_construct(cb, indices, img):
emb = [cb[i] for i in indices]
zn = torch.stack(emb)

zn = torch.reshape(zn, (16, 16, 256))
zn = torch.unsqueeze(zn, 0)
zn = zn.permute(0, 3, 1, 2)

xrec = model1024.decode(zn.to(device))
show_results(img, xrec)

If we call this function with cb and indices obtained in previous code blocks:

cb_construct(cb, indices, img)

we obtain absolutely the same reconstruction result as on Fig.2.

If we try to shuffle the indices and decode the latent space with shuffled vectors:

indices1 = copy.deepcopy(indices)
np.random.shuffle(indices1)
cb_construct(cb, indices1, img)

we obtain a new image with some abstraction:

Some other examples of “abstract art” based on latent spaces from other pictures:

Thus, we’ve tried experimentally that for creation of any image we need the codebook, the set of codebook-vectors indices in the defined order and Decoder. Intuitively it is clear that we need some system which can define the sub-set of codebook vectors and the order of their indices to generate some kinds of realistic images.

Note: the latent space of a high-resolution image is a concatenation of the latent spaces for 256x256 patches from which this image consists of.

Taming Transformer

Taming Transformer model is the second stage of image generator. It is trained to generate the sequence of indices for new image latent space. The generation is started from the initial condition processing. Here are the types of conditional images which are processed by the model (pictures from the article):

The <type of the input + input code> is sent to the model as initial parameter. Input code means the set of codebook-vectors indices of the conditional image latent space. The model is trained to predict the current index using previously predicted ones. The first index is predicted based on the input code. The transformer predicts the distribution of possible next indices (Fig. 1). If the input image resolution is 256x256 all previously predicted indices are used to predict the current one. For high resolution image each patch uses for prediction only previously predicted indices from neighboring patches in a sliding window, as it is shown in the picture below (the picture is from the article):

Taming Transformer uses VQGAN and Discriminator models trained in the first stage as a backbone. The training steps are following: to predict the distribution of codebook-indices for the whole patch, to send the predicted indices to Decoder and obtain the output patch, to send the output patch to Discriminator and obtain the patch features and then calculate a cross-entropy loss between the input features (obtaining from the input code) and output features.

I experimented with this google colab to generate new images from segmentation mask. I used the proposed in the colab input data. Here the results of 3 different runs for the same segmentation mask with pre-trained Taming Transformer model:

The segmentation mask may contain up 182 classes of objects (mask-values from 1 to 182).

The conclusion about Taming Transformer + VQGAN system:

1. The system is able to generate high-quality realistic images using conditional image-based input configuration.

2. The system might be used for image expansion: for example, a conditional image might be sent to the system as a top-half image part, and the output image will contain this top part + a generated bottom part.

3. It is needed a special pre-processing of the conditional image depending on the input type, for example a proper configuration of segmentation mask.

4. The system is used for generation new images similar to the conditional image, but it can’t be used for changing the conditional image style.

CLIP + VQGAN system for new image generation

First, I continue with the code from section 2. In the code below I change the image latent space after encoding — multiply it by 0.7:

img = get_img_tensor("image path")
z, _, [_, _, indices] = model1024.encode(preprocess_vqgan(img.to(device)))
out = model1024.decode(0.7 * z)
show_results(img, out)

As a result, I obtain a winter landscape in another style:

I can change the latent space in another way, for example, multiply 70-th element of each vector of the latent space by 50:

ind = 70
z, _, [_, _, indices] = model1024.encode(img.to(device))

z = z.permute(1, 0, 2, 3)
z = [z[i] for i in range(256)]
z[ind] = z[ind] * 50
z = torch.stack(z)
z = z.permute(1, 0, 2, 3)

out = model1024.decode(z)
show_results(img, out)

The landscape style is changed in another way:

In these two experiments I “forget” about codebook indices and change the latent space as a whole.

The idea of CLIP + VQGAN system is similar: to change the image in a desirable way by changing the latent space as a whole. CLIP plays a role of Discriminator which understands a text description for a desirable image and produces a loss value. CLIP is a system trained to find image vs text similarity. The generation process looks like a training with changing latent space weights: CLIP produces an embedding-vector of the text description, VQGAN decodes the latent space and obtains the image, then CLIP produces the embedding-vector of the image and CLIP + VQGAN system calculates its cosine similarity with the embedding-vector of the input text description. The goal of this system is to maximize the similarity (the similarity is in the interval (0, 1)). To gain this goal the system changes the weights in the latent space in a backward propagation step. The main challenge is to find a trick with gradients transmission during the backward propagation, because VQGAN and CLIP are not a backbone for the system, just loaded pre-trained models. I tried CLIP + VQGAN realization from this google colab. I’ve changed an input image according to an input text description to obtain video effects in the image. Pictures below show some results:

Text prompt: “snow-covered spruces”.

Image transformation:

Text prompt 1: “white and red dots on a black silhouette of an elephant”.

Text prompt 2: “white and red flowers on a black silhouette of an elephant”.

Image transformation:

The source of the input picture featuring the house is: https://pt.pinterest.com/pin/448108231687867193/ :

Examples of changing art styles (the source of the input picture featuring the house is: https://pt.pinterest.com/pin/448108231687867193/):

You can find more examples of style transformations here.

And the example of new year art generated from noise:

Text prompt: “A new year spruce watercolor detailed” .

The conclusion about CLIP + VQGAN system:

In opposite to Taming Transformer + VQGAN system, CLIP + VQGAN system is a solution for an art rather than for photorealistic image generation. It is able to generate images in different styles accepting a user input in the most convenient form — in the form of a text description.

Conclusion

The goal of my both previous (about autoencoders) and present posts is step by step tracing of the development of the concept of new image generation with VQGAN:

- Autoencoders for image compression -> relatively small image latent space, image reconstruction for specific datasets.

- Vector Quantized autoencoders -> image latent space based on codebook and vector-quantization techniques, high quality image reconstruction of any image.

- Taming Transformers + VQGAN -> new image generation from codebook-vectors based on vector-indices prediction.

- CLIP + VQGAN -> new image generation based on image latent space changing according to a text description.

--

--