Tracking video objects in 300 lines of code

Laurent Picard

Published in

Google Cloud - Community

5 min readJun 25, 2020

⏳ 2021–10–08 update

Updated GitHub version with latest library versions + Python 3.7 → 3.9

👋 Hello!

In this article, you’ll see the following:

how to track objects present in a video,
with an automated processing pipeline,
in less than 300 lines of Python code.

Here is an example of an auto-generated object summary for the video <animals.mp4>:

🛠️ Tools

A few tools will do:

Storage space for videos and results
A serverless solution to run the code
A machine learning model to analyze videos
A library to extract frames from videos
A library to render the objects

🧱 Architecture

Here is a possible architecture using 3 Google Cloud services (Cloud Storage, Cloud Functions, and the Video Intelligence API):

The processing pipeline follows these steps:

You upload a video
The upload event automatically triggers the tracking function
The function sends a request to the Video Intelligence API
The Video Intelligence API analyzes the video and uploads the results (annotations)
The upload event triggers the rendering function
The function downloads both annotation and video files
The function renders and uploads the objects
You know which objects are present in your video!

🐍 Python libraries

Video Intelligence API

To analyze videos
https://pypi.org/project/google-cloud-videointelligence

Cloud Storage

To manage downloads and uploads
https://pypi.org/project/google-cloud-storage

OpenCV

To extract video frames
OpenCV offers a headless version (without GUI features, ideal for a service)
https://pypi.org/project/opencv-python-headless

Pillow

To render and annotate object images
Pillow is a very popular imaging library, both extensive and easy to use
https://pypi.org/project/Pillow

🧠 Video analysis

Video Intelligence API

The Video Intelligence API is a pre-trained machine learning model that can analyze videos. One of its multiple features is detecting and tracking objects. For the 1st Cloud Function, here is a possible core function calling annotate_video() with the OBJECT_TRACKING feature:

Cloud Function entry point

Notes:
• This function will be called when a video is uploaded to the bucket defined as a trigger.
• Using an environment variable makes the code more portable and lets you deploy the exact same code with different trigger and output buckets.

🎨 Object rendering

Code structure

It’s interesting to split the code into 2 main classes:

StorageHelper for managing local files and cloud storage objects
VideoProcessor for all graphical processings

Here is a possible core function for the 2nd Cloud Function:

Cloud Function entry point

Note: This function will be called when an annotation file is uploaded to the bucket defined as a trigger.

Frame rendering

OpenCV and Pillow easily let you extract video frames and compose over them:

Note: It would probably be possible to only use OpenCV but I found it more productive developing with Pillow (code is more readable and intuitive).

🔎 Results

Here are the main objects found in the video <JaneGoodall.mp4>:

Tracked object summary for JaneGoodall.mp4

Notes:
• The machine learning model has correctly identified different wildlife species: those are “true positives”. It has also incorrectly identified our planet as “packaged goods”: this is a “false positive”. Machine learning models keep learning by being trained with new samples so, with time, their precision keeps increasing (resulting in less false positives).
• The current code filters out objects detected with a confidence below 70% or with less than 10 frames. Lower the thresholds to get more results.

🍒 Cherry on Py 🐍

Now, the icing on the cake (or the “cherry on the pie” as we say in French), you can enrich the architecture to add new possibilities:

Trigger the processing for videos from any bucket (including external public buckets)
Generate individual object animations (in parallel to object summaries)

Architecture (v2)

A — Video object tracking can also be triggered manually with an HTTP GET request
B — The same rendering code is deployed in 2 sibling functions, differentiated with an environment variable
C — Object summaries and animations are generated in parallel