Learning Day 71: Image captioning

De Jun Huang
dejunhuang
Published in
2 min readJun 28, 2021

Image captioning

  • Vision → Language
  • The combination of CV and NLP

Traditional method

  1. Image content → keywords/tags
  2. Keywords → sentences

Advantages:

  • Directly use existing knowledge of CV and NLP
  • Able to filter noise
  • Modular

Disadvantage

  • If first step is wrong (CV), second step will be wrong (NLP)

Deep learning end-to-end method

CNN + RNN

  • A encoder-decoder structure
  • CNN as an encoder to extract the 2nd last layer of features map before softmax
  • RNN/LSTM as a decoder to take the feature map as input (the first input only)
An illustration of encoder-decoder structure of image captioning models (ref)
  • Neural Image Captioning(NIC)
An example of image captioning model architecture (ref)

CNN + RNN + Attention mechanism

  • Attention mechanism is only implemented after encoder
  • Output from CNN (encoder) will be divided ton multiple cell states (C1, C2 C3) for LSTM (decoder) to generate outputs (Y1, Y2, Y3) accordingly
  • The cell states are generated from the weighted hₜ, with different weights represent the attention weightage
Encoder + Decoder with attention mechanism

Evaluation metrics for image captioning

  • Meteor
  • CIDEr
  • BLEU@N

Thoughts about this topic

  • RNN has been briefly studied in Day 2530
  • I did not understand this topic, image captioning, very well, especially the attention mechanism. (Does attention mechanism require additional marking on the data? If not, how does the model determine which part should be focused)
  • Miss out some details such as beam search
  • Will study this topic again when I focus on NLP learning later on

Reference

Unless otherwise stated, all presented contents are summarised from this course for personal learning purposes

--

--