Convolutional Image Captioning

4 min readMar 30, 2024

Introduction

The image captioning problem is about teaching computers to look at pictures and describe them in words, just like humans do. It’s a mix of teaching computers to understand what’s in the picture and then write sentences that make sense. The goal is to make the computer’s descriptions sound natural and accurate.

The paper presents a convolutional approach to image captioning, an alternative to the traditional Recurrent Neural Network (RNN) methods that use Long-Short-Term-Memory (LSTM) units. Unlike traditional RNNs, LSTMs are designed to overcome the vanishing gradient problem by incorporating a gating mechanism that allows them to retain or forget information over long sequences selectively. In the context of image captioning, LSTMs process the visual features extracted by convolutional neural networks (CNNs) and generate captions sequentially, word by word. They maintain an internal memory cell that can store information over time, enabling them to capture the context and dependencies between words in the generated captions effectively. But there are certain limitations of these networks:

· Complex training: Due to their sequential nature, training LSTMs and RNNs can be time-consuming and computationally expensive.

· Vanishing gradients: Despite LSTM’s design to mitigate vanishing gradients, they can still suffer from this issue to some degree, affecting learning and performance.

· Limited Parallelism: The sequential processing limits the parallelism that can be achieved during training, slowing down the process.

Recent works in machine translation and conditional image generation with convolutional networks have shown benefits in overcoming the above-mentioned issues. Inspired by their success, the authors use convolutional networks for image captioning.

Convolutional Approach

**Convolutional** **Model for Image Captioning.**

The above image shows the architecture for the convolutional approach. The token “<S>” is added to mark the beginning of the sentence, and the token “<E>” to signal the end of the sentence. The words from ground truth go through four steps:

(1) They’re transformed into vectors through an embedding layer,
(2) combined with image features,
(3) processed by a CNN, and
(4) output probabilities for each word.

Here’s a breakdown of each step:

Input Embedding: An embedding layer is trained from scratch over one-hot encoded input words. The words are embedded in 512-dimensional vectors.
Image Embedding: Image features come from the VGG16 network, which is trained on ImageNet. Then, these features are adjusted to be 512-dimensional for consistency.
CNN Module: The combined input and image vectors undergo three layers of convolutions. For better performance, they use gated linear unit (GLU) activations, including weight normalization, residual connections, and dropout.
Output Embedding: After processing by the CNN module, the output embedding layer produces output probability distributions for each word in the caption.
Classification Layer: After convolution, they use a linear layer further to encode the vectors into a 256-dimensional representation per word. This is then expanded to a |Y|-dimensional (|Y|=9221) activation through a fully connected layer and passed through a softmax to determine the output word probabilities.

The cross-entropy loss is used during training to adjust the network’s parameters, fine-tune the VGG16 network after 8 epochs, and optimize the model using RMSProp with a learning rate of 5e−5, decreasing it by a factor of .1 every 15 epochs.

Dataset.

MS COCO dataset is used in this paper to train the model.

113287 training images
5000 validation images
5000 testing images

Results

CNN achieves comparable performance to the LSTM baseline.

Comparison of different methods on standard evaluation metrics

Captions generated by CNN are comparable to the LSTM and ground-truth caption. In the examples below, CNN can describe things like black and white photos, polar bear/white bowl, number of bears, signs in the donut image, which LSTM fails to do. CNN and LSTM captions are of similar quality.

Captions generated by our CNN are compared to the LSTM and ground-truth caption.

Because CNN is not sequential like LSTM, we can see a faster per-parameter rate for training per epoch.

CNN training loss is higher than LSTM, but its word accuracy is better on training data; on validation data, the difference between accuracies is very small (~1%)

Conclusion

Performance: The convolutional image captioning approach performs comparably to LSTM-based methods on the MSCOCO dataset.
Analysis: Convolutional networks produce more diverse predictions and have better classification accuracy without suffering from vanishing gradients.
Training Efficiency: Convolutional models can be trained faster per parameter than LSTMs due to their non-sequential nature.

The study concludes that convolutional approaches are a viable alternative to LSTM for image captioning tasks, offering comparable performance and certain advantages in training and diversity of predictions.

References

https://openaccess.thecvf.com/content_cvpr_2018/papers/Aneja_Convolutional_Image_Captioning_CVPR_2018_paper.pdf