Create an Image Captioning App with Gradio

Muhammad Ihsan
The Deep Hub
Published in
4 min read3 days ago
Photo by Monica Flores on Unsplash

One of the most widely adopted uses of generative AI today is image captioning. Image Captioning is the automatic process of generating text descriptions from images input into a model. In this article, we will use the BLIP (Bootstrapping Language-Image Pre-training) model available on Hugging Face to create an image captioning application. We will also use Hugging Face’s inference API and Gradio to create an interactive interface.

Image Captioning

The process of Image Captioning combines two main fields in artificial intelligence: Computer Vision (CV) and Natural Language Processing (NLP). Several steps occur in this process:

  1. Feature Extraction: At this stage, a computer vision model is used to extract important features from the image. The model used is usually a pretrained convolutional neural network (CNN) on image classification tasks, such as ResNet or EfficientNet.
  2. Sequence Modeling: Features extracted from the image are then input into a model such as Long Short-Term Memory (LSTM) or Transformer to generate a sequence of words based on these features. This model learns to produce text descriptions that match the content of the image.
  3. Caption Generation: In the final stage, the model generates a coherent text description that matches the given image.

BLIP (Bootstrapping Language-Image Pre-training)

BLIP is a model designed for tasks involving the relationship between images and text, including Image Captioning. BLIP uses a pre-training approach on large-scale data that includes relevant images and text. Here are some advantages of the BLIP model:

  1. Pre-training on Large-Scale Data: BLIP is trained on large-scale datasets that include various types of images and relevant text descriptions. This allows the model to capture various patterns and relationships between images and text.
  2. Bootstrapping Technique: The bootstrapping technique is used to iteratively improve the quality of image and text representations. Using this approach, BLIP can correct errors and improve accuracy during the training process.
  3. Versatility: The BLIP model can be used for various tasks involving images and text, such as image-to-text retrieval, text-to-image retrieval, and Image Captioning.

Implementation

Setting Up the API Key

First, we need to set up and load the API key from Hugging Face to access their inference services. This API key is usually stored in a .env file for security.

import io
import base64
import json
import requests
import gradio as gr
from dotenv import dotenv_values

config = dotenv_values(".env")
hf_api_key = config['HF_API_KEY']

In the code above, we use the dotenv library to load the API key from the .env file. Make sure to add your API key in the .env file with the format HF_API_KEY=your_hugging_face_api_key.

Creating the Function to Access the API

Next, we will create the get_completion function that will send the image to the Hugging Face API and receive the description results from the submitted image.

API_URL = "https://api-inference.huggingface.co/models/Salesforce/blip-image-captioning-base"
def get_completion(inputs, parameters=None, ENDPOINT_URL=API_URL):
headers = {
"Authorization": f"Bearer {hf_api_key}",
"Content-Type": "application/json"
}
data = { "inputs": inputs }
if parameters is not None:
data.update({"parameters": parameters})
response = requests.request("POST",
ENDPOINT_URL,
headers=headers,
data=json.dumps(data))
return json.loads(response.content.decode("utf-8"))

The get_completion function uses the requests module to send the image converted to base64 format to the Hugging Face API. The API result will be returned in JSON format.

Converting Image to Base64

We need to convert the uploaded image to base64 format before sending it to the API. The image_to_base64_strfunction is used for this purpose.

def image_to_base64_str(pil_image):
byte_arr = io.BytesIO()
pil_image.save(byte_arr, format='PNG')
byte_arr = byte_arr.getvalue()
return str(base64.b64encode(byte_arr).decode('utf-8'))

This function receives an image and converts it to a base64 string. This step is necessary because the Hugging Face API accepts image input in base64 format.

Creating the Captioner Function

The captioner function is used to combine all the previous steps and return the description from the uploaded image.

def captioner(image):
base64_image = image_to_base64_str(image)
result = get_completion(base64_image)
return result[0]['generated_text']

This function takes the image, converts it to a base64 string, sends it to the Hugging Face API using the get_completionfunction, and returns the text description of the image.

Creating the Gradio Interface

Finally, we will create an interface using Gradio.

gr.close_all()
demo = gr.Interface(fn=captioner,
inputs=[gr.Image(label="Upload image", type="pil")],
outputs=[gr.Textbox(label="Caption")],
title="Image Captioning with BLIP",
description="Caption any image using the BLIP model",
allow_flagging="never",
examples=["./example/dog.jpg", "./example/fish.jpg", "./example/cat.jpg"])

demo.launch()

This interface allows users to upload images and display descriptions generated by the BLIP model. With Gradio, we can easily display this application in a browser.

Conclusion

In this article, we have learned how to use the BLIP model from Hugging Face for Image Captioning. We have also created an interactive interface using Gradio that allows users to get automatic descriptions from uploaded images. Thank you for reading this article, happy learning!

--

--

Muhammad Ihsan
The Deep Hub

Computer Vision Enthusiast, Freelance AI Engineer. Posting about Programming and Philosophy. https://emhaihsan.hashnode.dev/ for articles in Bahasa Indonesia