BigDL Tutorial: Generate your own images from text with Stable Diffusion

Intel
Intel Tech
Published in
7 min readMar 29, 2023

Create original art with your laptop in just minutes.

From the GitHub* implementation

Authors: Ezequiel Lanza, Ruonan Wang

If every technology has a season, artificial intelligence has hit “summer.” A crop of advancements in AI has led to the current flourishing in the discipline with lots of high expectations for the future.

Computer vision is a prime example. Despite the heavy computational demands, huge gains have been made in image synthesis (Huang et al., 2018), the process of artificially generating images that contain specific content. It started with the machine-learning framework known as generative adversarial networks (GAN) to arrive at today’s diffusion models. This evolution offers data scientists models that are easy to train and fast to converge and that can reliably generate high-quality images.

This, in turn, plays an important role in generative AI (AIGC), which can produce all kinds of data, including audio, code, images, text, simulations, 3D objects, videos, and more. It works by training an algorithm how to generate new information based on previous training data. Among the many uses are text generation (GPT, Bidirectional Encoder Representations from Transformers (BERT) or, more recently, ChatGPT*), audio generation, text-to-image creation (DALL-E* or Stable Diffusion*), and others.

In this post, we’ll demonstrate text-to-image generation with optimized Stable Diffusion models that, thanks to BigDL (and the optimization in BigDL Nano) can be run on an Intel laptop.

Two Ways to Use Stable Diffusion

Stable diffusion can be used to generate image in two ways, unconditional and conditional.

Unconditional image generation. It can generate a new image from noise being with no condition in any context (like a prompt text or another image). The model will be trained and will generate random images. For details, check out this example of training a model with butterfly images.

Training set
Generated images

Conditional image generation. This model generates a new image from inputs. These include: text-to-image, image-to-image, semantic, inpainting and outpaiting. Let’s take a closer look:

  • Text-to-image (txt2img): Generates an image based on the input text. Input: Text -> Output-> Image

Here’s an example of input text: A dog wearing glasses.

  • Image-to-image. Super-resolution: Generates a super-resolution image based on low resolution images. Here’s an implementation of an upscaler diffusion model. Input: Image-> Output: Image
From “High-Resolution Image Synthesis with Latent Diffusion Models”
  • Semantic (img2img): Allows you to generate a new image based on Input image + Text. You can try it out with this image-to-image tutorial. In the example below, we asked it to generate a beautiful beach. Not content with our ideal vacation scene, we asked to add a golf course. The model took the beach generated as input and added a golf course to it.
Beach (L) Beach with golf course (R)
  • Inpainting: This fills masked regions of an image with new content, either because parts of the image are corrupted or to replace existing but undesired content in the image. Using this multi inpainting model, the wall clock below is swapped out with a Batman-style mask. (For the replacement image, use whatever your imagination dreams up.)
Home office (original photo from unsplash); home office with generated caped crusader mask
  • Outpainting: Here the painting occurs in the areas outside the original image. The model artificially “fills in” the image to the desired size. In the example below, we asked the model to generate an image “car in the street” and ask outpainting to fill in the lower left part of the image.

How Stable Diffusion Works: An Overview

Stable diffusion is a model used for high resolution image generation. To understand how diffusion models work without going deep in the complex mathematical principles, we’ll break down a txt2img Stable Diffusion model into three main parts:

  1. Text Encoder: It’s a transformer-based model called ClipText, based on GPT that is trained on images previously captioned images. Transformers have demonstrated a good understanding of language so they can easily identify and convert to tokens based on the intention of your text prompt.
  2. Image information creator (text-conditioned UNET): This is where Diffusion happens. The U-net (Resnet-CNN architecture) network used in this part previously trained. Diffusion theory can be explained in two main processes: forward and reverse diffusion. The working principle is to destroy the training data by gradually adding Gaussian noise, and then learn how to recover the data by reversing the noise process.

Preprocess stage: forward diffusion process, destroy the training data by continuously adding Gaussian noise to generate training samples.

Image from (Ho et al., 2020)

Training/inference stage: reverse diffusion process, model learns to recover the data from the noise, once it’s been trained.

3. Image decoder (VAE encoder): It receives the image generated by the image creator and converts the final image in the format desired.

Architecture overview. Image: Ezequiel Lanza

Where BigDL Comes In

You can use the architecture described above and get your image, but did you notice how long the process takes — sometimes minutes?

That happens because the models we’re using are large, but you can optimize them to reduce processing times. Without going too far into detail, there are parts that can be optimized to produce the same results in less time. The optimization already is baked into BigDL, taking into account multiple optimizations such as Intel® Optimization for TensorFlow* , Intel® Extension for PyTorch*, Intel® Distribution of OpenVINO*, Intel® AVX-512, among others.

BigDL Architecture. Image: Ruonan Wang

Generating Images Using BigDL

Now it’s your turn. We’ll walk you through the steps here or you can follow the implementation on GitHub*.

Installation

We recommend an Intel laptop/desktop with a minimum of 16GB of RAM and at least 15GB of free disk space.

To access the implementation, we recommend using a new virtual environment to run the demo and install the prerequisites so you’re ready to go.

conda create -n sd python=3.8 

conda activate sd

pip install -r requirements.txt

Open the folder where you’ve downloaded the files to run the installation script:

python launch.py 

After finishing the installation, the application will available on your device, then add this address to your browser: http://127.0.0.1:7860/

Optimize Your Model

Before generating your image, you’ll need to get the optimized model. Go to the “optimize model” tab to execute.

Launch the Web UI

After finishing the installation, the application will available on your device, then add this address to your browser: http://127.0.0.1:7860/

Now there are two options available:

  • CPU-FP32 will produce an optimized fp32 model for CPU, and a “CPU FP32” option (e.g. “v2.1-base CPU FP32”) will appear later in Switch Option.
  • CPU/iGPU FP16 will produce optimized fp16 models for both CPU and iGPU, and two “FP16” options (e.g. “v2.1-base CPU FP16”, “v2.1-base CPU+iGPU FP16”) will appear later in “Switch Option.”

Note: This step can take some time because the application will download the original model and optimize it for you in real time.

Once the model is optimized, you can type in any text to generate an original image.

Note: Since the model we’re working with is Hugging Face*, you’ll need to add an access token as pictured above.

Now that your model is ready, you can start generating images from the “txt2img” tab. The application still provides more options.

Conclusion

Stable Diffusion is a powerful tool with the potential to revolutionize many real-world applications. The models shown in this blog and their learning processes require a high amount of computation; optimizations such as those provided by Intel can shorten processing times.

For more open source content from Intel, check out open.intel or follow us on Twitter.

References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models (arXiv:2006.11239). arXiv. http://arxiv.org/abs/2006.11239

Huang, H., Yu, P. S., & Wang, C. (2018). An Introduction to Image Synthesis with Generative Adversarial Nets (arXiv:1803.04469). arXiv. http://arxiv.org/abs/1803.04469

About the Authors

Ezequiel Lanza is an open source evangelist on Intel’s Open Ecosystem Team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza

Ruonan Wang is an AI Frameworks Engineer at Intel AIA, currently focused on developing BigDL-Nano, a Python* package to transparently accelerate PyTorch* and TensorFlow* applications on Intel hardware.

--

--

Intel
Intel Tech

Intel news, views & events about global tech innovation.