Learning Day 71: Image captioning

Published in

dejunhuang

2 min readJun 28, 2021

--

Image captioning

Vision → Language
The combination of CV and NLP

Traditional method

Image content → keywords/tags
Keywords → sentences

Advantages:

Directly use existing knowledge of CV and NLP
Able to filter noise
Modular

Disadvantage

If first step is wrong (CV), second step will be wrong (NLP)

Deep learning end-to-end method

CNN + RNN

A encoder-decoder structure
CNN as an encoder to extract the 2nd last layer of features map before softmax
RNN/LSTM as a decoder to take the feature map as input (the first input only)

An illustration of encoder-decoder structure of image captioning models (ref)

Neural Image Captioning(NIC)

An example of image captioning model architecture (ref)

CNN + RNN + Attention mechanism

Attention mechanism is only implemented after encoder
Output from CNN (encoder) will be divided ton multiple cell states (C1, C2 C3) for LSTM (decoder) to generate outputs (Y1, Y2, Y3) accordingly
The cell states are generated from the weighted hₜ, with different weights represent the attention weightage

Encoder + Decoder with attention mechanism

Evaluation metrics for image captioning

Meteor
CIDEr
BLEU@N

Thoughts about this topic

RNN has been briefly studied in Day 25–30
I did not understand this topic, image captioning, very well, especially the attention mechanism. (Does attention mechanism require additional marking on the data? If not, how does the model determine which part should be focused)
Miss out some details such as beam search
Will study this topic again when I focus on NLP learning later on

Reference

Unless otherwise stated, all presented contents are summarised from this course for personal learning purposes

Machine Learning

De Jun Huang

Written by De Jun Huang

Editor for

dejunhuang

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams