Controlled Image Editing with Diffusion Models

Juneta Tao
6 min readSep 3, 2024

Image editing is made much easier with Diffusion Models. No need to use specialised software nor intensive manual operations. The guidance can be text-prompt, masks, or images.

Image with Mask

Paint by Example [1] take original image with irregular mask and reference image as input. The final image is generated by considering the high-level semantics of the reference (object) and the original image.

Self-supervised training is applied on the original image with mask by random cropping. Two tricks are used. Firstly to avoiding copy and paste effects, instead of using the results of CLIP image embeddings (256 patch tokens and 1 class token), only the class token embedding (1x1024) is used. A few fully connected layers are used to decode the feature and feet to diffusion model through cross attention. Then to bridge the gap between training and testing, image augmentation, e.g. flip, rotation, blur and elastic transform, is applied to the source image during training. Also, the masks are augmented to reduce the reliance of the shape of the provided mask.

--

--