Generate Stunning Artworks with CLIP Guided Diffusion

Create beautiful artworks by fine-tuning diffusion models on custom datasets, and performing CLIP guided text-conditional sampling, followed by SWIN-transformer based super-resolution

14 min readMar 9, 2022

Generated samples from different prompts (Image by Author)

Human creativity can no doubt be attributed as the most indispensable constituent to every great feat that we have ever accomplished. Deep generative models have widely been used to mimic this skill over the years, and these models are evidently getting better and better each day as a result of frequent accomplishments in research. Let’s explore the creative capabilities of these models, and take a deep dive into how we can make use of deep generative models, in combination with generalized vision-language models, to create beautiful artworks of various styles from natural language text prompts.

We will look at how to fine-tune diffusion probabilistic models on a custom dataset created from artworks in the public domain. During the sampling process to generate images, we will use a vision-language CLIP model to steer or guide this fine-tuned model with natural language prompts, without any extra training or supervision. Afterwards, the generated images will be enlarged to a larger size by using a Swin transformer-based super-resolution model, which turns the low resolution generated output into a high resolution image by generating finer realistic details, and enhancing visual quality. We will also briefly cover the concepts behind the inner workings of each of these models, and more details on integrating them in a bit.

Here is a general block diagram showing the various components.

Here are some examples of the artwork generation process from text prompts, using the final fine-tuned model with CLIP guidance:

“vibrant watercolor painting of a flower, artstation HQ” (Image by Author)

“artstation HQ, photorealistic depiction of an alien city” (Image by Author)

To see more generated artworks, check out this report.

Throughout this article, we will be using a code base I have put together:

GitHub — sreevishnu-damodaran/clip-diffusion-art: Fine-tune diffusion models on custom datasets and…

Fine-tune diffusion models on custom datasets and sample with text-conditioning using CLIP guidance combined with…

github.com

Dataset created for this project with public domain artworks:

Artworks In Public Domain

Artworks in the public domain to use in various generative tasks

www.kaggle.com

Diffusion Models

Over the years, deep generative models have evolved to model complex high-dimensional probability distributions across a range of perceptive and predictive tasks. These were accomplished by well-formulated neural network architectures and parametrization techniques. For sometime, Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs) and Flow-based models were the front runners of this area. In spite of the vast number of milestones that are getting accomplished with these models, they suffer from a range of shortcomings in terms of training stability, lack of diversity, and high sensitivity to changes in hyper-parameters.

Diffusion Probabilistic models, a new family of models were introduced by Sohl-Dickstein et al in 2015 to try to overcome these weaknesses, or rather to traverse other ways to solve generative tasks. They were inspired by non-equilibrium thermodynamics. Several papers and improvements later, they have now achieved competitive log likelihoods and state-of-the-art results across a wide variety of tasks, maintaining better characteristics compared to its counterparts in terms of training stability and improved diversity in image synthesis.

A graphical representation of the diffusion model (Image source: Ho et al. 2020)

The key idea behind diffusion models is the use of a parameterized Markov chain, which is trained to produce samples from a data distribution by reversing a gradual, multi-step noising process starting from a pure noise x_T, and gradually denoising at every step to produce less noisy samples x_T−1, x_T−2, … reaching the final synthesized sample x_0. Contrary to initial work on these models, it was later found that parameterizing this model as a function of the noise with respect to x_t and t, which predicts the noise component of a noisy sample x_t is better than predicting the noisy image x_t itself (Ho et al.). To train these models, each sample in a mini-batch is produced by randomly drawing a data sample x_0, a timestep t, and a noise epsilon, which together are used to produce a noisy sample x_t . The training objective is then:

That is, a simple mean-squared error loss between the true noise and the predicted noise. The approximation of the reverse predicted noise is done by a neural network, since these predictions depend on the entire data distribution, which is unknown. So, the latent information of the training data distribution is stored in the neural network part of the model.

We will be using diffusion model architectures and training procedures from the papers Improved Denoising Diffusion Probabilistic Models and Diffusion Models Beat GANs by Dhariwal and Nichol, 2021 (OpenAI), where the authors have improved the log-likelihood to maximize the learning of all modes of the data distribution, and other generative metrics like FID (Fréchet Inception Distance) and IS (Inception Score), to enhance the generated image fidelity. The model we will use has a neural network architecture based on the backbone of PixelCNN++, which is a U-Net based on a Wide ResNet with group normalization instead of weight normalization, to make the implementation simpler. These models have two convolutional residual blocks per resolution level, and use multi-head self-attention blocks at the 16×16 resolution and 8x8 resolution between the convolutional blocks. Diffusion time t is specified by adding the transformer sinusoidal position embedding into each residual block.

There are several other intricacies to understanding diffusion models with many improvements in recent literature, which all would be hard to summarize in a short article. For a better theoretical understanding and details on the implementation, I recommend going through the papers on diffusion models. At the time of writing this article, the total count of papers on diffusion models is not as overwhelming as the number of GANs papers.

A Faster Way of Sampling with DDIMs

DDPMs inherently suffer from the need to sample hundreds-to-thousands of steps to generate a high fidelity sample, making them prohibitively expensive and impractical in real-world applications, where the data tends to be high-dimensional.

A solution to get around this problem was to shift to the use of non-Markovian diffusion processes instead of Markovian diffusion processes (used in DDPMs) during sampling. This new class of models were called DDIMs (Denoising Diffusion Implicit Models), which follow the same training procedure as that of DDPMs to train for an arbitrary number of forward steps. The reverse process is performed with new generative processes, which enable sampling faster in only a subset of those forward steps during generation. The authors showed that DDIMs can produce high quality samples 10x to 50x faster compared to DDPMs.

Steering Gradients with CLIP

CLIP (Contrastive Language–Image Pre-training) has set a benchmark in the areas of zero-shot transfer, natural language supervision, and multi-modal learning, by means of training on a wide variety of images and language supervision. These models are not trained directly to optimize on the benchmarks of singular tasks, making them far less short-sighted on the visual and language concepts learned. This led to better performance compared to several supervised ImageNet-trained models, even surpassing the original ResNet50 without being trained explicitly on any of the 1.28M labeled samples. CLIP has been used in a wide variety of tasks since it was introduced in January, 2021.

Comparison of CLIP with other models (Image source: Radford et al.)

The authors used a large dataset created with around 400 million image-text pairs for training. In every iteration, a batch of N pairs of text and images are forwarded through an image and text encoder, which trains jointly to maximize the cosine similarity of the text and image embeddings of the 𝑁 real pairs (in the diagonal elements of the multi-modal embedding space represented in the figure below), while minimizing the similarity scores of the other N²−N elements (present at the non-diagonal positions) in the embedding space, to form a contrastive training objective. A symmetric cross-entropy loss is used to optimize the model on these similarity scores.

CLIP training process (Image source: Radford et al.)

We will use CLIP to steer the image sampling denoising process of diffusion models, to produce samples matching the text prompt provided as a condition. This technique has been used in works like DALL-E and GLIDE, and also to guide other generative models like VQGAN, StyleGAN2 and Siren (Sinusoidal Representation Networks) to name a few. This guidance procedure is done by first encoding the intermediate output image of the diffusion model during the iterative sampling process with the CLIP image encoder head, while the text prompts are converted to embeddings by using the text encoder head. Then, the resultant output image and text embeddings are used to compute a perceptual loss, which measures the similarity between the two embeddings. The gradients with respect to this loss and the intermediate denoised image are used for conditioning, or guiding the diffusion model during the sampling process to produce the next intermediate denoised image. This process is repeated until the total sampling steps are complete. We also use losses to control spatial smoothing like total variation and range losses, as well as image augmentations, to improve the quality. In addition to this, multiple cutouts of images are also taken in batches to minimize the loss objective, leading to improvements in the synthesis quality, and optimized memory usage when sampling from smaller GPUs.

Upscaling Generated Images with Super-resolution

Large deep generative models need to be trained on large GPU clusters for days or even weeks. On single and smaller GPUs, we are limited to being able to train 256x256 diffusion models, which can only output images with less visual detail. So, we will work around this by training a smaller 256x256 output model, and upscaling its predictions 3x times to obtain the final images of a larger size of 1024x1024. Conventional upscaling to enlarge images by using interpolation techniques such as bilinear or lanczos, results in degradation of image quality and blurring artifacts, as no new visual detail gets added. An easy remedy to this problem is to use a super-resolution model trained to recover the finer details by a generative process. This produces enlarged images with high perceptual quality and peak signal-to-noise ratio (PSNR).

Swin transformers are a class of visual transformer-based neural network architectures aimed at improving the adaptation of transformers for vision tasks similar to ViT/DeiT. They have achieved state-of-the-art results across various tasks such as image classification, instance segmentation, and semantic segmentation. They take a hierarchical approach in its architecture in building feature maps by merging patches (keeping the number of patches in each layer a constant with respect to the image size), when moving from one layer to the other, to achieve scale-invariance. Self-attention is computed only within each local window, thereby reducing computations to linear complexity compared to the quadratic complexity of ViTs, where self-attention is computed globally. Local self-attention lacks connections across windows, limiting modelling power, and this is solved by cyclic shifting when the image is partitioned for creating patches to essentially enable cross-window connections. This partitioning configuration is alternated to form consecutive non-shifted and shifted blocks, enhancing the overall modelling power.

(a) Building hierarchical feature maps by merging image patches in Swin transformers; (b) global computation of self-attention in ViT. (Image source: Liu et al.)

(a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks. W-MSA and SW-MSA are multi-head self-attention modules with regular and shifted windowing configurations, respectively. (Image source: Liu et al.)

We will make use of an image-restoration model proposed in the paper SwinIR: Image Restoration Using Swin Transformer, which is built upon swin transformer blocks. The generated image after N CLIP-conditioned diffusion denoising steps is fed as the input to this model. The architecture of SwinIR consists of modules for shallow feature extraction, deep feature extraction, and high-quality (HQ) image reconstruction. Shallow feature extraction module extracts the shallow features which have the low-frequency information. By means of a convolution layer and these are directly transmitted to the final reconstruction module. Deep feature extraction module consists of several Residual Swin Transformer blocks (RSTB). Each RSTB has several swin transformer layers for capturing local attention and cross-window interactions. The authors also use another convolution layer at the end of the block for feature enhancement with a residual connection, to provide a shortcut for feature aggregation. Both the shallow and deep features are fused at the final reconstruction module, producing the final restored or enlarged image.

SwinIR architecture (Image source Liang et al.)

For running the complete code interactively with more control and settings, take a look at this Kaggle Notebook.

Installations

git clone https://github.com/sreevishnu-damodaran/clip-diffusion-art.git -q
cd clip-diffusion-art
pip install -e . -q
git clone https://github.com/JingyunLiang/SwinIR.git -q
git clone https://github.com/crowsonkb/guided-diffusion -q
pip install -e guided-diffusion -q
git clone https://github.com/openai/CLIP -q
pip install -e ./CLIP -q

Create the Dataset

I have downloaded artworks that are in the public domain from WikiArt and rawpixel.com for creating the dataset used for this project. After downloading them, I resized everything to the size of 256x256. The dataset contains around 29.3k images. We will use this dataset to fine-tune our model.

Download it from here.

To use custom datasets for training, download/scrape the necessary images, and then resize them (and preferably center crop to avoid aspect ratio change) to the input size of the diffusion model of choice.

Note: Make sure all the images have 3 channels (RGB). In case of grayscale images, convert them to RGB.

Download Pre-trained Weights

wget https://openaipublic.blob.core.windows.net/diffusion/march-2021/lsun_uncond_100M_1200K_bs128.pt -P ./pretrained_models -q

Training Config & Hyperparameters

We will now select the hyper-parameters and other training configurations for fine-tuning with the custom dataset. We have selected reasonable defaults which allow us to fine-tune a model on custom datasets with the 16GB GPUs on Colab or Kaggle.

I have integrated Weights & Biases to perform better logging of metrics and images in the repository we use. So, just give a project name like --wandb_project diffusion-art-train to enable wandb logging

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 2 --num_heads 1 --attention_resolutions 16"DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear --learn_sigma True --rescale_learned_sigmas True --rescale_timesteps True --use_scale_shift_norm False"TRAIN_FLAGS="--lr 5e-6 --save_interval 500 --batch_size 16 --use_fp16 True --wandb_project diffusion-art-train --resume_checkpoint pretrained_models/lsun_uncond_100M_1200K_bs128.pt"

Run the traning job as follows:

python clip_diffusion_art/train.py --data_dir path/to/images $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

Refer to the OpenAI’s improved diffusion for more details on choosing hyper-parameters, and to select other pre-trained weights.

Generate Samples

Let’s download and use a checkpoint that was trained earlier for 5000 iterations on the same artworks-in-public-domain dataset, to generate samples. Do note that the we are using a fine-tuned checkpoint trained on a small number of iterations with single 16GB GPUs for demonstration purposes. Other practical applications may need more hyper-parameter tuning, longer training, and larger pre-trained models.

wget https://api.wandb.ai/files/sreevishnu-damodaran/clip_diffusion_art/29bag3br/256x256_clip_diffusion_art.pt -q

Give the prompts of your choice below

python clip_diffusion_art/sample.py \
"beautiful matte painting of dystopian city, Behance HD" \
--checkpoint 256x256_clip_diffusion_art.pt \
--model_config "clip_diffusion_art/configs/256x256_clip_diffusion_art.yaml" \
--sampling "ddim50" \
--cutn 60 \
--cut_batches 4 \
--sr_model_path pretrained_models/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth \
--large_sr \
--output_dir "outputs"

Super resolution is enabled by default and the SwinIR pre-trained weights will be downloaded automatically. Pass the --large_sr to use the large model.

Some examples to try:

“beautiful matte painting of dystopian city, Behance HD”
“vibrant watercolor painting of a flower, artstation HQ”
“a photo realistic apple in HD”
“beach with glowing neon lights, trending on artstation”
“beautiful abstract painting of the horizon in ultrafine detail, HD”
“vibrant digital illustration of a waterfall in the woods, HD”
“beautiful matte painting of ship at sea, Behance HD”
“hyper realism oil painting of beautiful skies, HD”

Options:

--images - image prompts (default=None)
--checkpoint - diffusion model checkpoint to use for sampling
--model_config - diffusion model config yaml
--wandb_project - enable wandb logging and use this project name
--wandb_name - optinal run name to use for wandb logging
--wandb_entity - optinal entity to use for wandb logging
--num_samples - - number of samples to generate (default=1)
--batch_size - default=1batch size for the diffusion model
--sampling - timestep respacing sampling methods to use (default="ddim50", choices=[25, 50, 100, 150, 250, 500, 1000, ddim25, ddim50, ddim100, ddim150, ddim250, ddim500, ddim1000])
--diffusion_steps - number of diffusion timesteps (default=1000)
--skip_timesteps - diffusion timesteps to skip (default=5)
--clip_denoised - enable to filter out noise from generation (default=False)
--randomize_class_disable - disables changing imagenet class randomly in each iteration (default=False)
--eta - the amount of noise to add during sampling (default=0)
--clip_model - CLIP pre-trained model to use (default="ViT-B/16", choices=["RN50","RN101","RN50x4","RN50x16","RN50x64","ViT-B/32","ViT-B/16","ViT-L/14"])
--skip_augs - enable to skip torchvision augmentations (default=False)
--cutn - the number of random crops to use (default=16)
--cutn_batches - number of crops to take from the image (default=4)
--init_image - init image to use while sampling (default=None)
--loss_fn - loss fn to use for CLIP guidance (default="spherical", choices=["spherical" "cos_spherical"])
--clip_guidance_scale - CLIP guidance scale (default=5000)
--tv_scale - controls smoothing in samples (default=100)
--range_scale - controls the range of RGB values in samples (default=150)
--saturation_scale - controls the saturation in samples (default=0)
--init_scale - controls the adherence to the init image (default=1000)
--scale_multiplier - scales clip_guidance_scale tv_scale and range_scale (default=50)
--disable_grad_clamp - disable gradient clamping (default=False)
--sr_model_path - SwinIR super-resolution model checkpoint (default=None)
--large_sr - enable to use large SwinIR super-resolution model (default=False)
--output_dir - output images directory (default="output_dir")
--seed - the random seed (default=47)
--device - the device to use

Generated Results

📌 View more generated artworks here

“beautiful matte painting of a dystopian city, Behance HD” (Image by Author)

Super-resolution results:

“vibrant matte painting of a house in an enchanted forest, artstation HQ” (Image by Author)

Credits

Developed using techniques and architectures borrowed from original work by the authors below:

Guided diffusion and improved diffusion by OpenAI
Original notebook on CLIP guidance sampling by Katherine Crowson (https://github.com/crowsonkb, https://twitter.com/RiversHaveWings) with improvements by nerdyrodent and sadnow (@sadly_existent)
SwinIR: Image Restoration Using Shifted Window Transformer

Huge thanks to all their great work! I highly recommend checking these out.

Code base & Dataset

Next Steps

GLIDE by OpenAI achieved remarkable results in this very same task of text-conditional image synthesis with diffusion models. The authors also compare different guidance strategies such as CLIP guidance and classifier-free guidance, as well as image editing using text-guided diffusion models. In the public CLIP models we used, the noisy intermediate images are out-of-distribution as these models are not trained on noisy images and this affects the sample quality of generation. So, training CLIP using noisy images would be a great way to improve this project. Moreover, this paper would be a good place to continue reading on these topics.

One thing we can be certain of is that we will get to see some extraordinary accomplishments, and even more interesting things being done with deep generative models in the future.