GenAI Essentials (Part 2)

Text-to-Image Stable Diffusion with Stability AI and CompVis on the Latest Intel GPU

Published in

Intel Analytics Software

6 min readOct 15, 2023

Figure 1. Examples generated by Stable Diffusion XL Base 1.0 (image source: Podell *et al*., 2023).

Stable diffusion (SD) models have been making headway in the generative AI space with their ability to generate photorealistic images from text prompts (Figure 1). These models are not only interesting AI developers, but also authors, artists, and teachers. There are numerous open-source SD models out there. I decided to test three of them on the powerful new Intel Data Center GPU Max Series:

Figures 2 and 3 show test images that I generated.

Figure 2. Prompted the stable diffusion model with “Horse eating a carrot on the Empire State Building” (image by author)

Figure 3. Prompted the stable diffusion model with “Pecan tree growing on the moon” (image by author)

Stability AI Stable Diffusion v2–1 Model

This model was trained on a cluster of 256 Nvidia A100 GPUs. It was fine-tuned from the Stable Diffusion v2 model. The original dataset was a subset of LAION-5B created by the DeepFloyd team at Stability AI. This dataset, with over 5.85 billion text-image pairs, is the largest text-image pair dataset known at the time of writing (Figure 4):

laion2B-en: 2.32 billion text-image pairs in English
laion2B-multi: 2.26 billion text-image pairs from 100+ other languages
laion1B-nolang 1.27 billion text-image pairs with an undetectable language

Figure 4. Examples of cats from the LAION-5B dataset (image source: https://laion.ai/blog/laion-5b/)

The model’s training path is a bit convoluted, but here it is:

Stable Diffusion 2-Base was trained from scratch for 550K steps on 256 x 256 pixel images, filtered for pornographic material, and then trained for 850K more steps on 512 x 512 pixel images.
Stable Diffusion v2 picks up where Stable Diffusion 2-Base left off and was trained for 150K more steps on 512 x 512 pixel images, followed by 140K more steps on 768 x 768 pixel images.
Stability AI Stable Diffusion v2–1 was further fine-tuned on 768 x 768 pixel images from Stable Diffusion v2 first with 55K steps followed by 155K steps with two different explicit material filters.

More details on the training can be found on the Stability AI Stable Diffusion v2–1 Hugging Face model card.

Stability AI Stable Diffusion XL Base 1.0 Model

Stability AI Stable Diffusion XL Base 1.0 (SDXL 1.0) is another text-to-image SD model, but with some improvements under the hood. It leverages a 3x larger UNET backbone architecture, as well as a second text encoder. Figure 1 shows a sample of the image outputs from SDXL 1.0.

To create SDXL 1.0, a base model was trained for 600K steps on 256 x 256 resolution images. Training was continued for 200K steps on 512 x 512 pixel images. Interestingly, the final stage of fine-tuning involves a variety of rectangular sizes, but that are composed of approximately a 1024 x 1024 pixel area.

A study of user preference showed that SDXL 1.0 outperformed the Stable Diffusion v2–1 model (Podell et al., 2023) (Figure 5). The fact that SDXL 1.0 uses a wide variety of image sizes for training in the final stage, as well as the model’s architectural changes, are key to its adoption and success.

Figure 5. Comparison of images generated from the same prompt to previous Stable Diffusion models and the SDXL 1.0 model (image source: Podell et al., 2023)

CompVis Stable Diffusion v1–4 Model

This model picks up training from Stable-Diffusion-v1–2 and was fine-tuned on 512 x 512 pixel images for 225K steps on LAION-Aesthetics v2 5+ data. The dataset is a subset of the previously mentioned LAION-5B dataset selected for high visual quality (Figure 6).

Figure 6. Samples from the LAION-Aesthetics high visual quality dataset (image source: https://laion.ai/blog/laion-aesthetics/)

The Intel Data Center GPU Max 1100

I used this GPU for inference tests. It has 48 GB of memory, 56 Xe-cores, and 300 W of Thermal Design Power. On the command line, I can first verify that I do indeed have the expected GPUs:

clinfo -l

The output of this command shows that I have four of these GPUs in my host system:

Platform #0: Intel(R) OpenCL Graphics
 +-- Device #0: Intel(R) Data Center GPU Max 1100
 +-- Device #1: Intel(R) Data Center GPU Max 1100
 +-- Device #2: Intel(R) Data Center GPU Max 1100
 `-- Device #3: Intel(R) Data Center GPU Max 1100

Similar to the nvidia-smi function, you can run xpu-smi on the command line to get GPU usage statistics:

xpu-smi dump -d 0 -m 0,5,18

The result is a printout of utilization for the GPU device 0 updated every second:

getpwuid error: Success
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
13:34:51.000,    0, 0.02, 0.05, 28.75
13:34:52.000,    0, 0.00, 0.05, 28.75
13:34:53.000,    0, 0.00, 0.05, 28.75
13:34:54.000,    0, 0.00, 0.05, 28.75

Run the Stable Diffusion Example

A Jupyter notebook for SD text-to-image experimentation is hosted on the Intel Developer Cloud. You can access by going to Training and Workshops once you register as a Standard user. And then click “Launch” under the “Text-to-Image with Stable Diffusion” (Figure 7).

Figure 7: Try out the GenAI Essentials under the Trainings and Workshops section of the Intel Developer Cloud. Image by author.

It uses the Intel Extension for PyTorch to speed up inference. One of the key functions is _optimize_pipeline where ipex.optimize is called to optimize the DiffusionPipeline object:

    def _optimize_pipeline(self, pipeline: DiffusionPipeline) -> DiffusionPipeline:
        """
        Optimizes the model for inference using ipex.

        Parameters:
        - pipeline: The model pipeline to be optimized.

        Returns:
        - pipeline: The optimized model pipeline.
        """

        for attr in dir(pipeline):
            if isinstance(getattr(pipeline, attr), nn.Module):
                setattr(
                    pipeline,
                    attr,
                    ipex.optimize(
                        getattr(pipeline, attr).eval(),
                        dtype=pipeline.text_encoder.dtype,
                        inplace=True,
                    ),
                )
        return pipeline

The notebook should only take a minute or so to run. Once the model is loaded into memory, each image should only take a few seconds to generate. Just be sure that the pytorch-gpu environment is selected when you open the Jupyter kernel so that you don’t have to install any packages. There is a mini user-interface (Figure 8) in the notebook using the package ipywidgets. Select the desired model, enter a prompt, and select the number of images to output.

Figure 8. Mini user interface for prompt-to-image within the Jupyter notebook

The latest Intel GPU Max performed well and will definitely be a contender in the generative AI space. Please let me know if you have any questions or would like help getting started with trying out stable diffusion. You can reach me on the Intel DevHub Discord server here (user name bconsolvo), LinkedIn here, or Twitter here. Thank you for reading. Happy coding!

Disclaimer for Using Stable Diffusion Models

The stable diffusion models provided here are powerful tools for high-resolution image synthesis, including text-to-image and image-to-image transformations. While they are designed to produce high-quality results, users should be aware of potential limitations:

Quality Variation: The quality of generated images may vary based on the complexity of the input text or image, and the alignment with the model’s training data.
Licensing and Usage Constraints: Carefully review the licensing information associated with each model to ensure compliance with all terms and conditions.
Ethical Considerations: Consider the ethical implications of the generated content, especially in contexts that may involve sensitive or controversial subjects.

For detailed information on each model’s capabilities, limitations, and best practices, please refer to the respective model cards.