Papers which proposes deep learning models for video-text or image-text understanding are reviewed in detail.