Create an Image Generation App with Gradio

Muhammad Ihsan
CodeX
Published in
5 min readJul 11, 2024
Photo by Steve Johnson on Unsplash

In today’s era, technology is an integral part of our daily lives. The convergence of technological advancements and individual creativity often results in remarkable achievements. One of the latest breakthroughs garnering significant attention in the field of computer science is the ability of computers to generate images solely from text descriptions. Imagine being able to describe a unique landscape or a futuristic object in a few sentences, and an AI model would generate an image that matches your description. This technology, known as image generation, employs advanced generative models like Stable Diffusion. In this article, we will explore how the Stable Diffusion model works and how we can use it to generate images from text using the Hugging Face API and an interactive Gradio interface.

Image Generation

Image generation is the process of creating new images from text descriptions or prompts. This process involves using machine learning models trained on large datasets of images and corresponding texts. These models learn to identify and understand the relationships between text and visual elements in images.

Some applications of image generation include:

  1. Generative Art: Creating new digital artwork based on artistic descriptions.
  2. Graphic Design: Assisting designers with visual ideas based on project descriptions.
  3. Game Development: Generating visual assets such as characters, backgrounds, and objects based on narrative descriptions.

Stable Diffusion

Stable Diffusion is one of the widely used models for image generation tasks. Essentially, diffusion models work by reversing the diffusion or spreading process. Imagine how ink spreads on a sheet of paper when a drop of ink falls: this diffusion process spreads the ink from a concentrated state to a more dispersed and blurred state. Diffusion models, including Stable Diffusion, do the opposite — they start with a very blurred (noisy) image and gradually “clean” or sharpen it based on patterns learned during training. The stages in a diffusion model include:

  1. Creating Noise: Initially, the model starts with a noisy image, which is essentially a random image without meaningful visual information.
  2. Learning Patterns: The model then learns to recognize patterns and structures in the image that relate to the given text. During training, the model learns from many image and text pairs, enabling it to identify objects and concepts from text descriptions.
  3. Iterative Process: Through an iterative process, the model gradually reduces the noise, producing an increasingly clear image that matches the text description at each step of the iteration.
  4. Guidance: Users can control how strongly the text influences the final result using the guidance scale. This allows for greater control over the desired outcome.

The Stable Diffusion model uses a Transformer architecture adapted for generative tasks. This architecture enables the model to capture long-term dependencies in the data and understand complex contexts in text descriptions. Advantages of stable diffusion include:

  1. High-Quality Images: Stable Diffusion can produce high-resolution images with sharp details, making it suitable for creative and design applications.
  2. Flexibility and Control: Users can control various aspects of the generation process, such as the number of inference steps and the guidance scale. This allows for customization of the final output according to specific needs.
  3. Speed and Efficiency: Compared to some other generative models, Stable Diffusion offers better speed and efficiency in generating high-quality images.

Implementation

In this article, we will create an image generation application using the Stable Diffusion model available on Hugging Face. We will use the Hugging Face inference API and Gradio to create an interactive interface that allows users to generate images from text descriptions.

Setting Up the API Key

First, we need to set up and load the API key from Hugging Face to access their inference services. This API key is usually stored in a .env file for security.

import io
import requests
import gradio as gr
from PIL import Image
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env file
hf_api_key = os.environ['HF_API_KEY']

In the code above, we use the dotenv library to load the API key from the .env file. Ensure that you have added your API key to the .env file in the format HF_API_KEY=your_hugging_face_api_key.

Creating the Function to Access the API

Next, we will create the query function that sends the text description to the Hugging Face API and receives the generated image.

API_URL = "https://api-inference.huggingface.co/models/runwayml/stable-diffusion-v1-5"
headers = {"Authorization": f"Bearer {hf_api_key}"}

def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.content

The query function uses the requests module to send the text description to the Hugging Face API. The API response is returned as an image in byte format.

Creating the Generate Function

We then create the generate function that calls the query function and returns the generated image from the text description.

def generate(prompt, steps, guidance, width, height):
image_bytes = query({
"inputs": prompt,
"parameters": {
"steps": steps,
"guidance_scale": guidance,
"width": width,
"height": height
}
})
image = Image.open(io.BytesIO(image_bytes))
return image

This function takes parameters such as the text description (prompt), the number of inference steps (steps), the guidance scale (guidance), the width (width), and the height (height). The API response is then converted into an image format that can be displayed.

Creating the Gradio Interface

Finally, we will create an interface using Gradio.

demo = gr.Interface(
fn=generate,
inputs=[
gr.Textbox(label="Your prompt"),
gr.Slider(label="Inference Steps", minimum=1, maximum=100, value=25, info="In how many steps will the denoiser denoise the image?"),
gr.Slider(label="Guidance Scale", minimum=1, maximum=20, value=7, info="Controls how much the text prompt influences the result"),
gr.Slider(label="Width", minimum=64, maximum=512, step=64, value=512),
gr.Slider(label="Height", minimum=64, maximum=512, step=64, value=512),
],
outputs=[gr.Image(label="Result")],
title="Image Generation with Stable Diffusion",
description="Generate any image with Stable Diffusion",
allow_flagging="never"
)

demo.launch()
demo.close()

This interface allows users to input text descriptions and adjust the image generation parameters. With Gradio, we can easily display this application in a web browser.

Conclusion

In this article, we have learned how to use the Stable Diffusion model from Hugging Face for image generation. We also created an interactive interface using Gradio that allows users to generate images from text descriptions easily. This technology opens up significant opportunities in various creative and design applications. Thank you for reading this article, and happy learning!

--

--