Building Pinterest Canvas, a text-to-image foundation model
Eric Tzeng; ML Research Scientist, ATG | Raymond Shiau; ML Research Scientist, ATG |
In this engineering note, we wanted to share some of our latest progress on Pinterest Canvas, a text-to-image foundation model for enhancing existing images and products on the platform. Building image foundation models has been a core part of Pinterest’s ML strategy for the past decade, but these have been focused on representation learning tasks (e.g. our Unified Visual Embedding v2, v3, etc.). More recently, we have begun to explore the application of generative models, specifically those that can be conditioned on existing Pinterest images, to create new backgrounds for products.
Pinterest Canvas is built as a text-to-image model that can support arbitrary conditioning information in the form of product masks and conditioning images for stylistic guidance. In this post, we will discuss first the training of the base text-to-image model, then the fine-tuning process to generate photorealistic backgrounds conditioned on masks, and finally an in-context learning process for conditioning on image styles.
Because we’re primarily interested in image generation as a way of visualizing existing products in new contexts, rather than generating completely new content from scratch, we don’t have a direct product use for the typical image generation model that takes a text caption and tries to generate an image based on that caption. Nevertheless, this text-to-image task ends up being a useful way to teach a model about the visual world, so that it can then learn how to generate cohesive and compelling objects and scenes.
To that end, we train our own image generation foundation model, named Pinterest Canvas, to serve as the backbone model that can be fine-tuned for all downstream product applications. Pinterest Canvas is a latent diffusion model trained exclusively in-house at Pinterest that adheres closely to standard latent diffusion model designs. For efficiency, the diffusion model itself operates in the latent space learned by a variational autoencoder (VAE). The final latent representation generated by the diffusion model is then decoded into image pixels by this VAE’s decoder. Text captions are encoded using both CLIP-ViT/L and OpenCLIP-ViT/G, and are fed to a convolutional UNet via cross-attention in order to incorporate text conditioning information during the generation process.
During training, we sample random caption-image pairs from our dataset. We then encode each image into its latent representation using our VAE, embed each text caption using CLIP, and sample a random diffusion timestep for each pair. Noise is added to each image latent, according to its sampled diffusion timestep, and the UNet is tasked with denoising the latent given the text embedding and timestep index.
We filter our training data aggressively, in an attempt to make sure images are of high quality, adhere to trust and safety standards, and have relevant associated text data. Text for each image is collected from a variety of sources to ensure diversity, including public Pin titles and descriptions, as well as generated alt text for SEO and accessibility use cases. Even after this stringent filtering, we are still left with over 1.5 billion high quality text-image pairs, which ensures that after a long and carefully managed training schedule, Pinterest Canvas converges to generate high quality images that capture an inspiring and appealing aesthetic.
There are many more improvements we could layer on top of this training protocol to further improve the performance of the base model. Notably, we’ve explored using reinforcement learning to encourage Canvas to generate more diverse and visually appealing images, which we’ve written about in a ECCV publication: Large-scale Reinforcement Learning for Diffusion Models. However, in this post we’d like to instead explore how we go beyond this base model to train image generation models that can perform specific visualization tasks.
Fine-tuning Pinterest Canvas for background generation
Training Pinterest Canvas gives us a strong base model that understands what objects look like, what their names are, and how they are typically composed into scenes. However, as previously stated, our goal is training models that can visualize or reimagine real ideas or products in new contexts. We’ll use the base model as a starting point, but modify the training task. Now, instead of training it to create images from scratch, we’ll ask it to fill in missing parts of images, a task commonly referred to as inpainting.
Note that we’re discussing just one possible specialization of Pinterest Canvas — in this case, one that performs inpainting. However, in practice we have ideas for a lot of other tasks to help perform other kinds of visualizations!
In order to get our model to inpaint images properly, we’ll need to supply some additional information as well. Instead of only passing a text caption and a partially noisy latent, we additionally pass:
- A target image with missing portions
- A binary mask, indicating whether pixels in the target image are valid or missing
Together, these inputs define the inpainting problem: the end goal is to generate an image that matches the provided target image, but with the missing portions filled in.
This model is trained in two stages. In the first stage, we use the same dataset as we did for the base Pinterest Canvas, and we additionally generate random masks for the model to inpaint during training. This stage teaches the model to fill in missing image regions, but because the masks are not directly related to the image in any way, we find that after first stage training, the model often extends or changes the shapes of objects.
Thus, in the second stage, we focus specifically on product images, and use a segmentation model to generate product masks by separating the foreground and background. Existing text captions typically describe only the product while neglecting the background, which is critical to guide the background inpainting process, so we incorporate more complete and detailed captions from a visual LLM. In this stage, we train a LoRA on all UNet layers to enable rapid, parameter efficient fine-tuning. Finally, we briefly fine-tune on a curated set of highly-engaged promoted product images, to steer the model toward aesthetics that resonate with Pinners.
Separating the training into these two stages allows us to ease the model into the new inpainting task — the first stage keeps the same training data but introduces the additional mask input, and the second stage teaches the model to preserve object boundaries and focus exclusively on generating background content. After convergence, we end up with a model that can take a product Pin and generate a background according to a text prompt:
In practice we also found that our VAE struggled with reconstructing fine details in images. Simply compositing the original image and generated image together as a post-processing step produced visible blending artifacts. We found that it helped to retrain our VAE to accept these additional conditioning inputs as well, so that during the decoding process it seamlessly blends the original and generated image content, while ensuring pixel-perfect reconstructions of products.
Like other diffusion models, Pinterest Canvas is capable of generating multiple variations, which often differ in quality. We leverage this to boost quality during inference, by generating multiple backgrounds for a product, and selecting the top k with a reward model trained on human judgments spanning defects, fidelity, and aesthetics.
Personalizing results
Although we’re pretty happy with the quality of results generated by our backdrop outpainting model, in practice it is still quite limiting to try and describe the desired background solely in words. Sometimes it’s easier to simply provide examples of the style you’re after! To this end, we further augment our model with the ability to condition on other images, using their style to guide the generation process.
To enable this additional functionality, we build off of IP-Adapter, a method for training an adapter network that processes additional image prompts. Within the diffusion UNet, these additional image prompts are encoded into embeddings and then passed alongside the text embeddings to new image-specific cross attention layers, thereby allowing the diffusion network to attend to both image and text prompts. We follow the IP-Adapter training setup and condition directly on the target image. In order to preserve backdrop generation capability, we found it was important to jointly fine-tune on the second stage backdrop inpainting task, and so we reuse the same products-focused dataset.
We are experimenting with different ways of collecting conditioning images, including using boards with strong styles as well as automatically mining style clusters, though simply conditioning on the ground truth image itself was surprisingly effective as well. We also found that using our internally developed Unified Visual Embedding (UVE) to embed the conditioning images generally led to a much stronger effect on the resulting generations, as compared to only using other embeddings like CLIP. UVE is our core visual signal at Pinterest used for visual search and recommendations, and by providing it as a conditioning input to Pinterest Canvas, we’re able to tap into that rich visual understanding to more strongly influence the resulting outputs. We’re excited to start gathering customer input on these approaches through the recently announced Pinterest Ad Labs.
Future
The next set of improvements to the Pinterest Canvas model fall into three categories:
- The underlying diffusion backbone model is being upgraded to a more modern Transformer diffusion architecture (DiT). Our training results already indicate that this model is able to generate product backgrounds at a higher resolution and fidelity, particularly when we simultaneously upgrade to a more performant fine-tuned text encoder.
- One active area of research for the team is rethinking the binary-masking approach to model conditioning. The binary pixel masking constraint (also referred to as hard-masking) is important to fulfill our promise to merchants that their products will never be altered or misrepresented by not allowing the model to modify the pixels inside the product mask. However, there are some scenarios where this constraint prevents us from generating immersive and useful background visualizations for our users. For example, the model’s background generation capability cannot introduce dynamic lighting into the scene if the pixels or alpha channel cannot be modified in any way, making it difficult to work with scenes involving multiple products or more complex backgrounds. Another area where a soft-masking approach would be useful is to allow the model to clean up mistakes from the segmentation model if it has a high confidence that either too much or not enough of the border was clipped.
- Since we found that the use of our Pinterest-optimized visual embeddings (UVE) for image conditioning led to much stronger results compared to the CLIP-like baselines, we will continue incorporating UVE modeling improvements into Pinterest Canvas. Following this insight, we’re exploring using CLIP-like multimodal embeddings trained from Pinterest data and specifically tuning them to improve the text conditioning component of the model.
We are excited to share more about this work in a future post!