Learning Day 71: Image captioning
Published in
2 min readJun 28, 2021
Image captioning
- Vision → Language
- The combination of CV and NLP
Traditional method
- Image content → keywords/tags
- Keywords → sentences
Advantages:
- Directly use existing knowledge of CV and NLP
- Able to filter noise
- Modular
Disadvantage
- If first step is wrong (CV), second step will be wrong (NLP)
Deep learning end-to-end method
CNN + RNN
- A encoder-decoder structure
- CNN as an encoder to extract the 2nd last layer of features map before softmax
- RNN/LSTM as a decoder to take the feature map as input (the first input only)
- Neural Image Captioning(NIC)
CNN + RNN + Attention mechanism
- Attention mechanism is only implemented after encoder
- Output from CNN (encoder) will be divided ton multiple cell states (C1, C2 C3) for LSTM (decoder) to generate outputs (Y1, Y2, Y3) accordingly
- The cell states are generated from the weighted hₜ, with different weights represent the attention weightage
Evaluation metrics for image captioning
- Meteor
- CIDEr
- BLEU@N
Thoughts about this topic
- RNN has been briefly studied in Day 25–30
- I did not understand this topic, image captioning, very well, especially the attention mechanism. (Does attention mechanism require additional marking on the data? If not, how does the model determine which part should be focused)
- Miss out some details such as beam search
- Will study this topic again when I focus on NLP learning later on
Reference
Unless otherwise stated, all presented contents are summarised from this course for personal learning purposes