Godzilla — Image Captioning App

Hainan Xiong
7 min readDec 14, 2021

--

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.

Team Godzilla: Diwei Zhang, Kexin Yang, Haoming Chen, Yiting Han, and Hainan Xiong

Date: 12/13/2021

Introduction

Image captioning aims to generate a description of an image. It is a challenging yet important problem in Computer Vision. Image captioning allows the machine to process the image, understand its context, and express it in human-readable language.

The fast development of Computer Vision and Natural Language Processing enables us to employ learning-based methods to utilize the huge amount of data to tackle this problem. Image captioning has a wide range of applications in the real world, such as Content Based Retrieval Systems(CBRS) and Aid to the blind. In this project, we would like to employ the use of state-of-the-art models to combine image processing with language processing, utilizing the power of supervised-learning to develop an application to caption the images.

Problem Statement

We would first explore the benchmark datasets and new evaluation metrics in image captioning. Then, we would explore the state-of-art models in image feature extraction and language sequence generation tasks. The general idea is to use CNNs to extract features from images, then feed it into language models to generate the corresponding captions. We would fine-tune the pre-trained model and test different evaluation metrics to achieve better performance.

After building and training the model, we would integrate our model with the user interface into a scalable application that is accessible to the public which can be fun! :)

Model

To build the model for our image captioning task, we first did an extensive literature search on papers related to caption generation. We decided to adopt 2 of the models from these published papers as our baselines. We chose these 2 baseline models because they both used a relatively simple encoder-decoder structure, which made it easy for us to replicate. The structure of these models are described below.

Baseline model 1

The first baseline adopts the strategy described in Show and tell: A neural image caption generator (2015). https://arxiv.org/abs/1411.4555

  • Feature extraction: pre-trained VGG
  • Text generation: LSTM
  • Data: Flickr 8k

We input the features (4096, ) from VGG and tokenized caption into the following model

Baseline model 2

Our second baseline model adopts the strategy described in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015). https://arxiv.org/abs/1502.03044

  • Feature extraction: Inception V3
  • Model: encoder-decoder model + attention
  • Encoder: CNN with a single Fully connected layer to pass in image feature vectors
  • Decoder: GRU with attention to generate captions
  • Data: the Microsoft COCO dataset

Our model

For our final model, we tried to improve the structure of baseline model 2. Specifically, we decided to add 2 more dense layers in the encoder model to let it process the extracted image feature better. For the decoder model, we switched from GRU to LSTM because we want to see if the additional training parameters in LSTM can improve the semantic structure and accuracy of our generated captions.

  • Feature extraction: Inception V3
  • Model: encoder-decoder model + attention
  • Encoder: CNN with 3 Fully connected layers to pass in image feature vectors
  • Decoder: LSTM with attention to generate captions
  • Data: the Microsoft COCO dataset

Visualization

To visualize the final caption generated by our model, we printed out the real and predicted caption , the input image, as well as the attention plot for each word in the generated caption using the attention weights for the word.

Below we presented some examples of our image captioning model output.

Example 1

Image

Real Caption: <start> a tabby cat crouches behind a pair of shoes <end>

Prediction Caption: a cat looking at a shoe <end>

Attention plot:

Example 2

Image:

Real Caption: <start> there is a yellow and red transit bus making it’s way down the road <end>

Prediction Caption: an old red and white bus near a street near a city street <end>

Attention plot

Example 3

Image

Real Caption: <start> a person is riding waves on a surfboard near a bird flying <end>

Prediction Caption: a person riding his skateboarding board in mid like the beach <end>

Attention plot

Example 4

Image:

​​

Real Caption: <start> a woman standing on a tennis court holding a tennis racquet <end>

Prediction Caption: a professional tennis player in white shirt holding a ball gloves and is playing tennis <end>

Attention plot:

Example 5

Image:

Real Caption: <start> a person kite surfing on a <unk> day <end>

Prediction Caption: a man flying into the air near a surfboard <end>

Attention plot

Based on our visualization, we can see that our model is able to capture key figures in the image which can produce relatively accurate keywords to describe them. For instance, the keywords “cat”, “shoes”, “person”, “tennis player”, etc. However, when it comes to describing the context of the image, the caption produced by our model still lacks a good semantic structure, which results in somewhat strange and un-human-like sentences, such as “ flying into the air”. Another problem we noticed is that our model sometimes produced repetitive sentences such as “near a street near a city street”. To solve these problem would probably require further improvements on the structure of encoder-decoder and attention model as well as more training.

Frontend

[Screenshot 1. Home page]

[Screenshot 2. Upload an image]

[Screenshot 3. Show prediction]

For the frontend, we built a very straightforward and cute website to interact with users. Users can upload an image (.jpg) by clicking the dropzone, then our backend will receive the image and start to predict the caption for this image. Finally, the predicted caption will appear above the image.

We mainly used the open-source, reusable component-based, front-end JavaScript library — React to build our website. React is extremely intuitive to work with and provides interactivity to the layout of any UI. React’s ability to reuse system’s components greatly boosts productivity and allows developers to maintain the website more efficiently so that we don’t bother to change every single use if one tiny part needs to be changed.

FastAPI is used to create an API which allows the application to interact with the code. We store the model behind the API and expose it via a front-end user interface. Within the API, we added an end point that will predict the text and return it as a response. After uploading a photo, the backend server will instantly receive a request from the user’s web browser. Then it will refer to the directory and load all relevant models to make predictions. Finally, the backend server will return prediction results back to the user interface and display the auto generated captions. The front-end and backend were wrapped into two isolated docker containers and deployed on Google Cloud Platform. We created a kubernetes cluster for the deployment. Users can now access our app in their browsers. On our website, they can upload their images and view the predicted captions easily.

Conclusion

As part of this project, we were able to achieve 2 main goals, as described below.

Firstly, we have adopted two baseline models from published papers. The first one with pre-trained VGG to extract features and LSTM for text generation. The second baseline model has Inception V3 to extract features and GRU for test generation. We improved the structure of the second baseline model by adding 2 more dense layers in the encoder model and switched from GRU to LSTM.

Secondly, we built up a minimalist yet beautiful frontend enabling users to import pictures and successfully deployed it.

Overall, to extend our project, we aim to train our models with larger and more diversified dataset to empower our model to predict on a wider range of image captions.

Reference

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164).

Wang, Q., & Chan, A. B. (2019). Describing like humans: on diversity in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4195–4203).

--

--