Note: If you miss out on the first article of this series, please click the link below to read more.
Diffusion Models for Image-to-Image Translation
Palette [15] develops a unified framework for image-to-image translation based on the conditional diffusion models which focusing on four tasks: colorization, inpainting, uncropping and JPEG restoration. It’s notable that the simple implementation of image-to-image diffusion models outperforms previous GAN-based methods on all four tasks. Palette discover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, suggesting that L2 loss is preferred, as it leads to a higher sample diversity, it further demonstrate the importance of self-attention in the neural architecture through empirical studies. The results of palette is shown in Fig 23, the proposed method apparently outperform other methods in quality.
Score-based diffusion models have achieved the state-of-the-art results in unpaired image-to-image translation. However, previous methods completely ignored the training data in the source domain, leading to suboptimal solutions. To address this issue, [16] proposed energy-guided stochastic differential equations (EGSDE) that use an energy function pretrained on both the source and target domains to guide the inference process of a pretrained SDE for realistic and faithful unpaired I2I. EGSDE is based on two feature extractors and carefully designs the energy function to encourage the transferred image to preserve domain-independent features and discard domain-specific ones. Additionally, [16] provide an alternative explanation of EGSDE as a product of experts, with each of the three experts (corresponding to the SDE and two feature extractors) contributing to faithfulness or realism. Top of Figure 24 illustrates incorporating the realism expert and faithful expert to preserve the domain-independent features and discard domain-specific ones of EGSDE and bottom of Fig24 represents visualization results of EGSDE.
While diffusion models can generate high quality and diverse images, current conditional diffusion models still struggle to maintain high similarity with the condition image in image-to-image translation tasks due to the Gaussian noise added in the reverse process. To address this issue, Li et al. [16] introduced a diffusion model for image-to-image translation based on Brownian bridges and GANs. The process begins by encoding the image with a VQ-GAN. In the resulting quantized latent space, the diffusion process, formulated as a Brownian bridge, maps between the latent representations of the source and target domains. The process is completed by decoding the quantized vectors with another VQ-GAN to synthesize the image in the new domain. These two GAN models are independently trained on their respective domains. The proposed method improved the learning efficiency and translation accuracy by confining the diffusion process in the quantized latent space.
Previous image-to-image translation methods often require specialized architectural design and the training of individual translation models from scratch, which can be difficult to achieve for high-quality generation of complex scenes, especially when paired training data is scarce. In [17], PITI addresses this issue by treating each image-to-image translation problem as a downstream task and introducing a simple, generic framework that adapts a pretrained diffusion model to a variety of image-to-image translation tasks. To improve generation quality, PITI uses adversarial training to enhance texture synthesis in the diffusion model training and normalized guidance sampling.
PITI’s framework, shown in Fig 25, consists of two steps. The first step involves pretraining on various image-to-image translation tasks using a diffusion model, and the second step involves fine-tuning on downstream tasks.
In [37], the authors extend the diffusion model by replacing the classifier with a task-specific model. The image-to-image translation uses denoising diffusion implicit models and includes a regression problem and a segmentation problem to guide the image generation towards the desired output. At every step of the sampling process, the gradient of the task-specific network is infused. The method is demonstrated using a regressor (based on an encoder) or a segmentation model (using the U-Net architecture). This approach has the advantage of not requiring the retraining of the entire diffusion model, except for the task-specific model. As a result, the diffusion model does not need to be retrained for different tasks on the same dataset.
In [19], the authors propose a novel method for unpaired image-to-image translation using denoising diffusion probabilistic models without adversarial training. Their method, UN paired Image Translation with Denoising Diffusion Probabilistic Models (UNIT-DDPM), trains a generative model to infer the joint distribution of images in both domains as a Markov chain by minimizing a denoising score matching objective conditioned on the other domain. Specifically, [19] update both domain translation models simultaneously and generate target domain images using a denoising Markov Chain Monte Carlo approach that is conditioned on the input source domain images based on Langevin dynamics. This approach provides stable model training for image-to-image translation and produces high-quality image outputs.
Diffusion Models for Image Segmentation
Image segmentation is the process of dividing an image into multiple segments or regions, each of which corresponds to a different object or part of the scene depicted in the image. The goal of image segmentation is to simplify or change the representation of an image into something that is more meaningful and easier to analyze. Recently, some methods apply diffusion model to perform segmentation tasks.
[20] propose applying a diffusion model, a type of mathematical model that simulates the diffusion of a substance over time and space, to the problem of image segmentation, named SegDiff. They use a denoising network that takes in both the input image and the current estimate of the binary segmentation map, and the output of this network is used to update the estimate. The authors show that this method produces state-of-the-art results on multiple benchmarks for image segmentation, including Cityscapes, building segmentation, and nuclei segmentation. They also introduce the idea of using multiple generations of the model, by averaging the output of multiple runs, to improve performance and calibration. This is the first time that diffusion models have been applied to image segmentation.
As shown in Fig28, they use a denoising network that takes in both the input image and the current estimate of the binary segmentation map, and the output of this network is used to update the estimate. The input image and the current estimate are passed through two different encoders, and the sum of these multi-channel tensors is passed through a U-Net to provide the next estimate. This process is repeated until the estimates converge to a stable solution. The authors show that averaging the output of multiple runs, or multiple generations, leads to an improvement in overall accuracy. They also introduce a novel way of conditioning the model on the input image by using a sum of the multi-channel tensors from the encoders and train the model end-to-end, without relying on a pre-trained backbone network.
Different from SegDiff directly utilize structure of Diffusion model to perform segmentation, In the [21], the authors investigate the use of denoising diffusion probabilistic models (DDPM) as a source of effective image representations for discriminative computer vision tasks, specifically semantic segmentation. They show that the intermediate activations from the U-Net network used in DDPM capture high-level semantic information valuable for downstream vision tasks, and use these activations to design a simple semantic segmentation approach that outperforms existing baselines when few labeled images are provided. The authors also compare the DDPM-based representations with those produced by generative adversarial networks (GANs) and demonstrate the advantages of DDPM in the context of semantic segmentation.
The model is shown in Fig29, in this approach for image segmentation, the authors use a diffusion model trained on a large number of unlabeled images to extract pixel-level representations from a smaller number of labeled images. The extracted representations are concatenated and used to train an ensemble of multi-layer perceptrons (MLPs) to predict semantic labels for each pixel in the labeled images. To segment a test image, the authors extract its pixel-wise representations using the diffusion model and use the trained MLPs to predict the pixel labels, with the final prediction being obtained through majority voting. This approach is designed to exploit the discriminability of the representations produced by the diffusion model. The authors use specific blocks and time steps of the diffusion process to extract the representations, but do not tune these parameters for each dataset. They also fix the noise for all time steps.
In [22], the authors propose a diffusion probabilistic model (DPM)-based method for medical image segmentation. They use dynamic conditional encoding to incorporate image prior information into the model, and a feature frequency parser to filter high-frequency components in the Fourier space. The proposed method, called MedSegDiff, is evaluated on three different medical image segmentation tasks with different image modalities: optic-cup segmentation, brain tumor segmentation, and thyroid nodule segmentation. It outperforms the state-of-the-art on all three tasks, demonstrating the effectiveness and generalization of the proposed method. To the authors’ knowledge, this is the first time that a DPM-based model has been proposed for general medical image segmentation.
Framework of MedSegDiff is illustrated in Fig29, the authors design a model based on diffusion models, which are generative models composed of a forward diffusion stage and a reverse diffusion stage. In the forward process, the input label is gradually added Gaussian noise through a series of steps. In the reverse process, a neural network is trained to recover the original segmentation map by reversing the noising process. The authors use a UNet network for this learning process, and condition the step estimation function on the input image by adding the feature embeddings of the input image and the segmentation map for the current step, which are then passed through a UNet decoder for reconstruction. The step index is also integrated into the embeddings and decoder features using a shared learned look-up table.
(TO BE CONTINUED)
Reference:
[14] Lafite: Towards language-free training for text- to-image generation.
[15] Palette: Image-to-image diffusion models
[16] EGSDE: Unpaired Image- to-Image Translation via Energy-Guided Stochastic Differential Equations
[16] VQBB: Image-to-image Trans- lation with Vector Quantized Brownian Bridge
[17] Pretraining is All You Need for Image-to-Image Translation
[18] The Swiss Army Knife for Image-to-Image Translation: Multi-Task Diffusion Models
[19] UNIT-DDPM: UN- paired Image Translation with Denoising Diffusion Probabilistic Models
[20] SegDiff: Image Segmentation with Diffusion Probabilistic Models
[21] Label-Efficient Semantic Segmentation with Diffusion Models
[22] MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model