Caption This — A Hosted Deep Learning-based Image Captioning Service for Increased Accessibility

12 min readDec 13, 2021

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 Course.

“A picture is worth a thousand words” — Someone famous.

Authors: Shih-Yi Tseng, Matthew Stewart, Stephen Knapp, Al-Muataz Khalil, and Ed Bayes

Group: BKKST

Project Motivation

The world is moving increasingly online, with approximately half of the world’s population having access to the Internet (Dennis and Kahn, 2021). The internet has created an unparalleled opportunity to spread information, knowledge, and learning, from information retrieval to photo sharing and video transmission. However, 314 million people living with blindness and visual impairment often struggle to access and utilize these resources (Ono et al., 2010; AFB, 2009). Broadening access to these resources will help to provide more equal opportunities for visually impaired individuals, as well as an enhanced quality of life.

Screen readers help to broaden accessibility by providing audio description for web pages. Such tools often rely on image captioning software to caption images, with many baseline applications relying on recurrent neural network (RNN)-based models (Yesilada et al., 2004). The development of cutting-edge transformer-based computer vision and natural language processing (NLP) models presents an opportunity to improve the accuracy of screen readers, thereby increasing their reliability.

Project Goals

In “Caption This”, we utilize a state-of-art transformer model based on OpenAI’s Contrastive Language-Image Pre-training (CLIP) model architecture to perform image captioning, and compare these models to an RNN model with attention and CNN feature extractor (InceptionV3). These models are then deployed at scale as a web-based application that allows users to upload an image of an everyday scene or activity and then generates captions for that image.

To achieve this, our goals are to:

Use RNN-based methods to create a baseline model.
Use transformer based methods to create two state-of-the-art models.
Deploy all models in an app with a visualization component to compare model performance.
Evaluate the performance of our customized models compared to a baseline model.

Model Data

We used two publicly available datasets to train our model: COCO (Microsoft) and Flickr 8K, as they are the industry benchmarks and are released under a CC0 license (public domain).

Microsoft COCO is the benchmark for object recognition. We are using the first version of the dataset that consists of 164,000 images split into training (83,000), validation (41,000) and test (41,000) sets, with a total of 616,000 labels (Lin et al., 2014). Each image contains bounding boxes and per-instance segmentation masks as well as natural language descriptions of images. Figure 1 contains examples of image-text pairs from MS-COCO.

**Figure 1.** Example image-text pairs from MS-COCO.

Flickr 8K is the benchmark for sentence-based image description and search. It consists of 8,091 images chosen from six different Flickr groups. Each image is paired with five different captions that provide descriptions. Figure 2 shows examples of image-text pairs from the Flickr 8K dataset.

**Figure 2.** Example image-text pairs from Flickr 8K.

A combination of the two datasets was used for training the models, using an 80–10–10 train-validation-test split. Together, these datasets provided a total of approximately 172,000 images and 650,000 image captions.

Process Flow

For our team to work together to develop the application, we created a data pipeline process flow (see Figure 3) using the following resources and platforms for individual components:

Python (v3.8) with the Tensorflow (v2.7) package for deep learning models.
TensorflowHub and OpenAI for obtaining pre-trained models (used for transfer learning).
Google Colaboratory with GPU resources for exploratory data analysis (EDA), model training, and evaluation.
Google Cloud Platform: bucket storage, container registry, virtual machines, and Kubernetes clusters.
FastAPI for writing the application programming interface (API) server.
React.js for writing the application front end.
Visual Studio Code as the integrated development environment.
Docker for running containers of API service, front end, and deployment.
Kubernetes (K8s) for scalable deployment.
Ansible for automated deployment (Infrastructure as Code).

**Figure 3.** Overview of the process flow.

To structure our work, first the MS-COCO and Flickr 8K datasets were downloaded from the internet and organized into a Google storage bucket. EDA and model training in Colaboratory was then performed. Models are described in subsequent sections. The notebook code was then reformatted into Python scripts and integrated with the API written with FastAPI in a Docker container. A separate container was built for the application front end written using React in Javascript. We then deployed the application on a Kubernetes Cluster on the GCP. Following this, users are able to access the application on the internet by uploading images, which will then automatically generate captions with different models and return the results to the user.

Models

Four models were used in this work: a baseline RNN model, and three customized models based on transformers: (1) an encoder-decoder transformer model, (2) a prefix transformer model, and (3) a distilled prefix transformer model. The user interface is able to output results from each model (see Figure 4).

**Figure 4.** Screenshot of the deployed user interface.

Baseline Model

Our baseline model was inspired by the TensorFlow tutorial titled “Image Captioning with Visual Attention”, adapted from Xu et al’s (2015) paper: “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.”

This model consists of a CNN encoder (InceptionV3-based) that converts input images into image features, and a RNN decoder (GRU) with attention that attends to the image features to predict the next word. Specifically, Bahdanau’s additive attention is used to combine the RNN hidden states and image features into context vectors that the RNN uses to generate the caption.

Customized Models

Both models contain an encoder that converts images into features, and a decoder that decodes image features into captions. The first model, the “standard encoder-decoder transformer model”, uses a standard encoder-decoder transformer model, and the second model, the “prefix transformer model”, is a prefix language model. The details of these models will be described in the following sections.

Image Feature Extraction

For both models, instead of convolutional neural networks, we chose the CLIP (Contrastive Language–Image Pre-training) from OpenAI (2021) to extract features from input images.

CLIP was trained on a large dataset of 400 million image-text pairs available on the internet, and learned latent visual and language representations that matched each other based on contrastive representation learning. CLIP embedding, shown in Figure 5, can be used in a variety of zero-shot learning tasks and achieves compelling performance comparable to other SOTA models. Compared to conventional convolutional neural networks trained on classification tasks, CLIP embedding captures more fine details about the images associated with language description, and would be an efficient feature extractor for image captioning tasks.

**Figure 5.** Schematic for contrastive pre-training for CLIP model (OpenAI, 2021).

There are several variants of image encoders for CLIP. We chose the ViT/B-16 model as our image feature extractor, which embedded an input image into a 512 x 1 vector in the latent visual space.

Model Architecture

Model 1: standard encoder-decoder transformer model

We adopted a standard encoder-decoder transformer architecture, shown in Figure 6, for our first model, inspired by a Keras tutorial. After embedding the input image with CLIP, the 512 x 1 feature vector is first projected into a 2D- array (16 x 512) before passing into the transformer encoder. The latent dimension for the transformer encoder is 512, and we set the “image length” to 16 (as translating the image into a sequence of length 16). The transformer encoder consists of only 2 encoder blocks, since the CLIP image encoder we chose is already a visual transformer. In each encoder block, we performed self-attention on the “image sequence” followed by a feed-forward network (2-layer multilayer perceptron).

The text input is first tokenized and encoded with positional information, and then passed into the transformer decoder. The transformer encoder consists of 6 decoder blocks. In each decoder block, the encoded text first goes through a masked self-attention block, followed by a cross-attention block that performs the K/V/Q attention between the encoded text and encoded image feature. Lastly, the results are passed through a feed-forward network (2-layer multilayer perceptron) before entering the next decoder block. The final output layer then maps the decoder output into vectorized text.

Model 2: prefix transformer model

**Figure 7.** Model architecture for prefix transformer model.

Our second model is a prefix language model, inspired by this Mokady et al’s paper titled “ClipCap: CLIP Prefix for Image Captioning” (2021). The architecture, shown in Figure 7, is very similar to Model 1. But instead of having the output of encoder enter each decoder block for cross-attention with the encoded text, we now placed the encoder output (16 x 512) in front of the encoded text input, as a “prefix” of length 16. The 6-block transformer decoder then generates captions based on the prefix, similar to using a prompt to generate sentences in a casual language model. In fact, the transformer decoder has a GPT-2 like architecture — it only consists of blocks of masked self-attention and feed-forward networks.

One critical feature for the prefix model is the use of prefix causal attention masks, which is described in the “T5” paper by Raffel et al. (2019) and the “UNILM” paper by Dong et al. (2019). As shown in Figure 8, the self-attention mask for the prefix model allows full attention to the “prefix” part but remains causal (masked) for the later sequence.

**Figure 8.** Schematics for different attention masks (Raffel et al., 2019).

Training and Evaluation

We built the two models in TensorFlow and trained them on the dataset combining samples from Flickr 8K and MS-COCO with cross entropy loss for 10 epochs with early stopping. The training progress is shown graphically in Figure 9 using the loss function.

**Figure 9.** Training progress for our two models.

The two models reached very similar performance end points. We evaluated the BLEU scores (BiLingual Evaluation Understudy) of the prefix model on the test images. The BLEU scores are metrics between zero and one for evaluating machine-translated text that measures the similarity of the machine-translated text to a set of high quality reference (Google Cloud, 2021), with values closer to 1 indicating that the candidate text is more similar to the reference text. The modified BLEU-n scores indicate the same metric on n-grams, and the longer n-gram score means better fluency. The results on 1000 randomly selected test images are:

BLEU-1 = 0.75
BLEU-2 = 0.58
BLEU-3 = 0.51
BLEU-4 = 0.52

Additionally, we performed distillation on the prefix model using a smaller architecture (1 encoder block, 3 decoder blocks, 8 attention heads, prefix length = 10), with the results shown in Figure 10. Distilling the prefix model reduced the validation loss by approximately 10%, making it the highest-performing model.

**Figure 10.** Training progress for the distilled prefix model.

The BLEU scores of the distilled prefix model on 1000 randomly selected test images are similar, but slightly improved, compared to that of the original model:

BLEU-1 = 0.75
BLEU-2 = 0.58
BLEU-3 = 0.52
BLEU-4 = 0.53

We examined some example captions generated on test images as a way of qualitatively assessing model performance for quality assurance purposes. We can see that both the encoder-decoder and prefix models generate reasonable captions for the images shown in Figures 11 and 12.

**Figure 11.** Example captions for test images generated by the encoder-decoder transformer model.

**Figure 12.** Example captions for test images generated by the prefix model.

Application Design and Deployment

Figure 13 illustrates how the deployed resources interact with other resources in the Google Cloud Platform. The Google Container Registry contains the images used for deployment, while the trained models and data are stored in a Bucket. The React front end is hosted on an NGINX web server contained within its own container, which interacts both with the API service container as well as the container performing the image captioning. The API, front end, and deployment are discussed in subsequent sections.

**Figure 13.** Schematic for our application design.

Application Programming Interface (API)

We built an API server to serve these models using FastAPI. Upon startup, the API server downloads the saved model weights from a GCP bucket, and when receiving an input image, returns the generated caption for that image. Figure 14 shows a screenshot of the server documentation page for this FastAPI deployment.

**Figure 14.** A screenshot of the API server documentation page for the site.

Front end

We used React.js to build the front end of our App. As shown in Figure 15–18, the front end allows users to upload an image and select a model type for captioning, and makes requests to the API. It then displays the caption generated by the selected model.

**Figure 15.** Start-up page of the Image Captioning application.

**Figure 16.** Dropdown menu for selecting different model types.

**Figure 17.** User uploading an image enables the “Generate caption” button.

**Figure 18.** Captions generated by different models for an user uploaded image.

Deployment

The application is then deployed on a Kubernetes Cluster within the Google Cloud Platform. We pushed the Docker images for both the API server and React front end to Google Container Registry, and then created virtual machines that run these containers on the cluster. Automated deployment using Ansible allows for simple versioning changes and alterations to the process workflow when necessary. After deployment, users can access and utilize our application on the internet, which provides real-time prediction serving.

Takeaways

State-of-the-art transformer based object detection and NLP models can be successfully deployed to improve the accuracy of image captioning, showing promise to improve the reliability of screen readers used by people with visual impairment. These models show high similarity in reference text comparison and also demonstrate strong performance even on unseen images.

To develop this work further, models could be trained on larger, more varied datasets (such as Google’s Conceptual Captions), and further models could be explored to see if we could iterate on our accuracy. Uploaded user data could also be used with user feedback from served predictions to improve model performance via online training.

In addition, a usability study could be undertaken in conjunction with a partner such as the American Foundation for the Blind (AFB) to understand how best to present the findings from the models to users. For example, we could explore the use of the BLEU score, which is not widely used outside of research circles, or whether there is another more effective metric for evaluating model performance with greater relevance to the visually impaired.

References

Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M. and Hon, H.W., 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.

Encyclopedia Britannica. 2021. Internet | Description, History, Uses, & Facts. [online] Available at: https://www.britannica.com/technology/Internet [Accessed 13 December 2021].

Google Cloud. 2021. Evaluating models | AutoML Translation Documentation | Google Cloud. [online] Available at: https://cloud.google.com/translate/automl/docs/evaluate#:~:text=BLEU%20(BiLingual%20Evaluation%20Understudy)%20is,of%20high%20quality%20reference%20translations [Accessed 13 December 2021].

Hossain, M.Z., Sohel, F., Shiratuddin, M.F. and Laga, H., 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), pp.1–36.

Keras documentation: Image Captioning. [online] Keras.io. Available at: https://keras.io/examples/vision/image_captioning/ [Accessed 13 December 2021].

Mokady, R., Hertz, A. and Bermano, A.H., 2021. ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.

Ono, K., Hiratsuka, Y., and Murakami, A., 2010. Global inequality in eye health: country-level analysis from the Global Burden of Disease Study. American journal of public health, 100(9), (pp. 1784–1788).

OpenAI. 2021. CLIP: Connecting Text and Images. [online] Available at: https://openai.com/blog/clip/ [Accessed 13 December 2021].

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.

TensorFlow. 2021. Image captioning with visual attention | TensorFlow Core. [online] Available at: https://www.tensorflow.org/tutorials/text/image_captioning [Accessed 13 December 2021].

The American Foundation for the Blind. 2021. A Study of Factors Affecting Learning to Use a Computer by People Who Are Blind or Have Low Vision. [online] Available at: https://www.afb.org/aw/10/2/16123 [Accessed 13 December 2021].

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2015, June. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057). PMLR.

Yesilada, Y., Harper, S., Goble, C. and Stevens, R., 2004, July. Screen readers cannot see. In International Conference on Web Engineering (pp. 445–458). Springer, Berlin, Heidelberg.