Ushering in the New Age of Video Understanding with PyTorch

A guide to open-source tools for efficient dataset and model development and analysis for video understanding with FiftyOne, PyTorch Lightning, and PyTorch Video

Published in

PyTorch

5 min readJun 22, 2021

Video understanding, while a widely popular and ever-growing field of computer vision, is often held back by the lack of video support in many tools. Hundreds of tools exist to expedite nearly all aspects of the computer vision lifecycle, but they generally only support image data. In recent months, open-source tools have begun to tackle the tooling issues for video-based computer vision.

Recent collaborations between these open-source tools are making it easier than ever to execute your video workflows.

PyTorchVideo

PyTorchVideo is a deep learning library with a focus on video understanding work. PytorchVideo provides reusable, modular, and efficient components needed to accelerate the video understanding research. PyTorchVideo is developed using PyTorch and supports different deep learning video components like video models, video datasets, and video-specific transforms.

PyTorch Lightning Flash

Lightning Flash is a new framework built atop PyTorch Lighting and provides a collection of tasks for fast prototyping, baselining, fine-tuning, and solving business and scientific problems with deep learning. Flash has recently been updated to support video tasks backed by PyTorchVideo.

FiftyOne

Visualizing videos, especially with labels, is always significantly more difficult than visualizing images. FiftyOne is an open-source tool for building high-quality datasets and computer vision models developed by Voxel51. It provides the building blocks for optimizing your dataset analysis pipeline, allowing you to get hands-on with your data, including visualizing complex labels, evaluating your models, exploring scenarios of interest, identifying failure modes, finding annotation mistakes, curating training datasets, and much more.

A more efficient video understanding workflow

The teams behind Lightning Flash and FiftyOne have joined together to support PyTorchVideo and close the loop on video-based workflows ranging from exploring datasets, training models, visualizing and evaluating results, and running distributed and parallelized inference.

Setup

In order to follow along with the examples in this post, you will need to install PyTorchVideo, Lightning Flash, and FiftyOne. We will also install Kornia which is used by Flash video tasks.

Let’s also download some example data to use later.

Exploring datasets

One of the reasons that FiftyOne was created was to fill the void of open-source dataset visualization and exploration tools. FiftyOne makes it easy to load datasets either in existing formats or in custom formats and visualize them in the FiftyOne App.

A video dataset visualized in the FiftyOne App (Image by author)

It also provides the concept of views into your dataset utilizing a powerful query language that lets you dig in and better understand your datasets.

A dataset view visualized in the FiftyOne App (Image by author)

These datasets and views can then be passed directly into the Flash datamodules and used for task finetuning or prediction.

These datamodules are customizable allowing you to specify batch sizes, transforms, PyTorchVideo clip samplers, and more.

Training models

Lightning Flash is designed to let you hit the ground running and start training models for tasks relevant to you in only a few lines of code. If you are working on video tasks, you can use Flash to load PyTorchVideo models directly:

We can then easily use a PyTorch Lightning trainer to finetune the PyTorchVideo model using the datamodule we constructed in the previous section. Training can be scaled to the cloud without any code modifications using a platform such as Grid.ai.

Once training completes, we can save the model for future use.

Visualizing and Evaluating Results

The integration between FiftyOne and Lightning Flash allows you to evaluate the models you train in minimal lines of code. Let’s start by loading a checkpoint like the one saved in the previous section:

Flash tasks support various prediction serializers that will return model results in specific formats. One of these serializers will return labels in the FiftyOne format allowing them to be added directly to a FiftyOne dataset:

PyTorchVideo and Flash model predictions visualized in FiftyOne (Image by author)

FiftyOne provides evaluation capabilities for classification, detection, and segmentation tasks letting you compute metrics like accuracy and mAP, view interactive confusion matrices, plot precision recall curves, and more. Let’s run evaluation on the ground truth and the newly added predictions on our dataset and plot a confusion matrix:

A confusion matrix visualized in FiftyOne showing Flash predictions (Image by author)

FiftyOne plots are interactive meaning you can attach them to a session object that will automatically update when the plot is modified. For example, we can click a cell of the confusion matrix to view all related samples in the App.

Viewing plots in notebooks requires installing ipywidgets:

Interactively visualizing a confusion matrix with FiftyOne (Image by author)

Scalable inference

Lightning Flash is built on top of PyTorch Lightning, which is a thin organizational layer on top of PyTorch. As a result, Flash can scale up across any hardware (GPUs, TPUS) with zero changes to your code. It also has the best practices in AI research embedded into each task so you don’t have to be a deep learning PhD to leverage its power.

Lightning Flash and FiftyOne integration

The integration between Lightning Flash and FiftyOne goes beyond just video understanding. In fact, nearly any FiftyOne datasets can now be loaded into Flash to train tasks. Additionally, predictions returned by Flash tasks can now be easily loaded, visualized, and evaluated in FiftyOne to be able to create better models, faster. You can use this integration for the following tasks:

Using FiftyOne to visualize object detection generated by Lightning Flash (Image by author)

Summary

Video data is being collected at massive scales, but up until now has been difficult to fully utilize due to a lack of video-based tooling. The release of PyTorchVideo and the integrations of PyTorch Lightning Flash and FiftyOne can expedite nearly every aspect of the video understanding workflow from dataset exploration, model training, analysis, visualization, and scalable inference. The possibilities for your video data are now greater than ever!

Disclosure

This article was a collaborative effort between the PyTorch Lightning Flash and Voxel51 teams.