Generating Captions

Describing Videos with Neural Networks

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. Recent advances are starting to enable machines to describe image with sentences. This experiment uses neural networks to automatically describe the content of videos.


This line of work has been the subject of multiple academic papers from the research community over the last year. Some of the proposed approaches have been implemented and are available as open-source:

NeuralTalk: implements the models proposed by Vinyals et al. from Google and by Karpathy and Fei-Fei from Stanford.

Arctic-Captions: implements the models proposed collaboratively by Université de Montréal & University of Toronto.

Visual-concepts: implements the model proposed by Hao Fang et al.


All experiment results were generated with NeuralTalk. It takes an image and predict its sentence description with a Recurrent Neural Network. The NeuralTalkAnimator was used to process video files.

NeuralTalk is overall very fascinating. With the right selection of inputs, it works with astounding accuracy and generates informative sentences. When it fails... Inputs & Outputs are cherrypicked, balancing accuracy VS comedy.


NeuralTalk´s model generates natural language descriptions of images. It leverages large datasets of images and their sentence descriptions to learn about the correspondences between language and visual data.

The model is based on a combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities. For more insights, read this great blog post: Image captioning for mortals.


The NeuralTalkAnimator is a python helper, that creates captioned videos. It take a folder with videos and returns a folder with processed videos back. It´s open source on GitHub. Thanks to @karpathy for releasing NeuralTalk! Send input video requests to @samim (<3min, Youtube 720p).

Final Thoughts

The rate of innovation in the field of machine captioning images is astounding. While results might still be inaccurate at times, they are certainly entertaining. The next generation of networks, trained on even bigger datasets, will undoubtedly operate faster and more precise.

Emerging novel approaches like Describing Videos by Exploiting Temporal Structure, Action-Conditional Video Prediction using Deep Networks in Atari Games and Searchable Video are highly fascinating. Exiting Times!

Keep up with developments at

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.