Automatic Image Captioning using Streamlit and Hugging Face Transformers

Vom Siri Krrishna Jorige
3 min readJul 29, 2023

--

https://www.analyticsinsight.net/wp-content/uploads/2021/08/AI-9.jpg

Introduction:

In this blog, we will explore how to build a simple web application that automatically generates captions for uploaded images using Streamlit, a popular Python web framework, and Hugging Face Transformers, a powerful library for natural language processing. This application leverages pre-trained deep learning models to generate meaningful captions for images, which can be a useful tool in various applications such as image indexing, accessibility, and content understanding.

Prerequisites:

Before diving into the code, make sure you have Python installed, along with the necessary libraries: Streamlit, transformers, and Pillow (PIL). You can install them using pip:

pip install streamlit

pip install transformers

pip install Pillow

Importing the necessary libraries:

First, we import the required libraries, including Streamlit, the transformers pipeline for image-to-text, and PIL’s Image module for image handling.

Load the Image Captioning Model:

We utilize the Hugging Face Transformers library to load a pre-trained image-to-text model. In this example, we’ll use the ydshieh/vit-gpt2-coco-en model.

Create the Streamlit Web Application:

We define the Streamlit web application by using the st.file_uploader function to allow users to upload an image. Once the image is uploaded, it is displayed on the web page.

Generate Image Caption:

To generate a caption for the uploaded image, we use the pre-trained model and the st.button function to add a button that triggers the caption generation.

Conclusion:

In this blog, we demonstrated how to create a simple image captioning web application using Streamlit and Hugging Face Transformers. The application allows users to upload an image and generates a descriptive caption for the image using a pre-trained image-to-text model. Image captioning is just one of the many exciting applications of deep learning in the field of natural language processing and computer vision. With Streamlit’s simplicity and the power of Hugging Face Transformers, you can build various interactive AI-powered applications to enhance your projects and applications. You can further extend this application by using different image captioning models and exploring various use cases to make it even more powerful and useful. Happy coding!

References:

https://huggingface.co/docs/transformers/main/tasks/image_captioning

--

--