Automatic Image Captioning for Language Learning

Eric Yang

Published in

Institute for Applied Computational Science

7 min readDec 14, 2021

Authors: Eric Yang, Taylor Shishido, Daniel Tan, Shijia Zhang

Date: December 14th, 2021

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.

Introduction

In recent years, it has been estimated that there are about 1.5 billion English learners worldwide, with this number expected to grow [1]. For language learners, practicing the language daily is imperative for improvement. Many language learning applications today have a pre-defined curriculum, which can be limiting in the customizability of vocabulary and content that each learner desires to focus on, such as daily occurrences or contexts common in the specific learner’s life. A language instructor is flexible in tending to a learner’s needs, but can be expensive and is not available at all times. For use cases where a language learner desires to describe an object or occurrence that they are viewing, traditional translation applications require learners to describe the query first in their native language.

With the recent advances in deep learning, computer vision, and natural language processing, and the growing databases of captioned images, we hypothesized that an image captioning application with language translation capabilities would aid English language learners in describing their specific daily occurrences [2]. Our proposed application allows users to submit an image (.jpeg or .png files) as input, and outputs a short sentence in English captioning the queried image. In addition, users are also able to select a language of interest for the caption to be translated to, enabling the learners to view the English caption along with their native language side-by-side.

Data

To develop such a model, we leveraged Microsoft’s Common Objects in COntext (COCO) open source dataset that contains 123,287 images of complex everyday scenes of objects in their natural contexts [3]. Briefly, each of the images were manually labeled by eight individuals, where five of the captions were selected per image after the quality control process. This process yielded about 600,000 image-caption pairs available for model development. Images include eighty object categories such as “cat”, “person”, “apple” and “bicycle”. Here are some image-caption examples from COCO:

Examples of image-caption examples from COCO

Model Schema

During model development, the group leveraged Python, Google Colab’s GPU resources, and deep learning libraries such as Keras and Tensorflow for efficient data exploration and model training. We explored two classes of model architecture.

The basic principles of the first model were inspired by an approach authored by the Tensorflow community and researchers at the Universities of Montreal and Toronto [4, 5]. Here, convolutional neural networks (CNN) first extracted meaningful features from images. In this step, we explored the MobileNet and InceptionV3 networks with various pre-trained weights [6]. Along with the tokenized text, the CNN encoder processed features were fed into a gated recurrent unit (GRU), which attends over the image features to predict the next word.

The second architecture we explored was inspired by a transformer based approach suggested in deep learning literature [7]. Inception v3 network first extracted image features and stored the features in .pny files. The encoder stack in the transformer processed these features. The decoder stack in the transformer took the processed features and true labels as inputs and returned the predicted next word along with the attention weights.

Results

After examining the performance of both model schema, we proceeded with model schema 1 in our final application as it generated more precise captions. Although both models performed well. During the model development process, we also implemented activation maps to understand model decisions.

Frontend Application

To design and implement the application frontend, we leveraged HTML, CSS and JavaScript programming languages. Upon entering the homepage, users are prompted to upload their desired image to be captioned.

After uploading the image, users are able to select a language they would like the English to be translated to if desired.

With all the inputs given, the users can click on the “Generate caption!” button to view the model predicted output along with its prediction. If the user would like to make another prediction, another image can be given via the “Generate another caption!” button.

UI of model output and corresponding translation

Architecture and Deployment

After extracting a trained model from Google Colab, we used the following tools for initiating and running our model deployment: Docker, Ansible Playbooks, NGINX, and Google Cloud Compute Engine.

We use the following tools for initiating and running our model deployment: Docker, Ansible Playbooks, NGINX, and Google Cloud Compute Engine. Given the minimal requirements of data storage for the current state of our web app, we did not feel it was necessary to use Kubernetes in the final deployment. However, this may change in the future if we build more functionalities into our web app or need data storage capacity. We would then need Kubernetes to enable scalability and management of containers.

The architecture of our app consists of two Docker images: a simple frontend and an API service. Frontend creates the website that the user interacts with, and API service contains the trained language model for caption generation and integrates the Google Translator API for caption translation. Our web application has one main page. We created an additional Dockerfile for the frontend container to communicate properly with the Nginx server when deployed on GCP, so the frontend-simple image has two Dockerfiles and one docker build script. The API service container is a bit more complex as it also has the deep learning models for caption generation, saved via checkpoints in a folder called “api”, and supporting files (eg: Pipfile). These two containers are deployed in GCP via Ansible Playbooks using .yml scripts in a deployment-specific folder. A Nginx web server is used to deploy the web app externally. Both containers are launched in a VM instance, and no additional data storage or services are required.

Conclusion

Throughout the project, we were able to demonstrate and develop a scalable application that leverages the important deep learning operations concepts discussed in the course. First, we developed and implemented two broad categories of architecture to generate captions from images, combining concepts from computer vision and natural language processing. Through activation maps, we were able to understand the strengths and weaknesses of our approaches in the initial proof of concept stages. Second, leveraging web development resources and publicly available translation APIs, we developed an aesthetically pleasing application interface that allows user customization for translation preferences. Third, we followed best cloud deployment and architectural design practices that enabled our application to be modular.

To continue to improve this work, we recommend training our current best performing model further with even more diverse datasets. Learning from image-caption pairs that are from novel sources would increase the generalizability of the model. Another approach to increase the performance of our application is to leverage the novel computer vision and natural language processing technologies that are being advanced everyday. Finding the correct combination of models would be fruitful in improving application accuracy. Finally, since we intend the language learning application to be user-specific, implementing user feedback and continuous learning features in our application would enable the models to be personalized and fine-tuned based on the learners’ most common use cases.

Thank you for reading our post! I hope this application proves to be helpful for English language learners, and serves as an inspiration for the data science community to continue to improve this work. This article discussed the high level and most interesting aspects of our project, please checkout our GitHub repository for the detailed methodology, scripts and deployment instructions.

References

Beare, K. (2019, November 18). How many people learn English around the world? ThoughtCo. Retrieved December 13, 2021, from https://www.thoughtco.com/how-many-people-learn-english-globally-1210367.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6), 1–36. https://arxiv.org/abs/1810.04020
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer, Cham.
Image captioning with visual attention : Tensorflow Core. TensorFlow. (n.d.). Retrieved December 13, 2021, from https://www.tensorflow.org/tutorials/text/image_captioning.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057). PMLR.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Gautam, T. (2021, January 20). A guide to use transformers using tensorflow for caption generation. Analytics Vidhya. Retrieved December 14, 2021, from https://www.analyticsvidhya.com/blog/2021/01/implementation-of-attention-mechanism-for-caption-generation-on-transformers-using-tensorflow/