Tracking video objects in 300 lines of code

Laurent Picard
Google Cloud - Community
5 min readJun 25, 2020

⏳ 2021–10–08 update

  • Updated GitHub version with latest library versions + Python 3.7 → 3.9

👋 Hello!

In this article, you’ll see the following:

  • how to track objects present in a video,
  • with an automated processing pipeline,
  • in less than 300 lines of Python code.

Here is an example of an auto-generated object summary for the video <animals.mp4>:

Tracked object summary for animals.mp4

🛠️ Tools

A few tools will do:

  • Storage space for videos and results
  • A serverless solution to run the code
  • A machine learning model to analyze videos
  • A library to extract frames from videos
  • A library to render the objects

🧱 Architecture

Here is a possible architecture using 3 Google Cloud services (Cloud Storage, Cloud Functions, and the Video Intelligence API):

Architecture

The processing pipeline follows these steps:

  1. You upload a video
  2. The upload event automatically triggers the tracking function
  3. The function sends a request to the Video Intelligence API
  4. The Video Intelligence API analyzes the video and uploads the results (annotations)
  5. The upload event triggers the rendering function
  6. The function downloads both annotation and video files
  7. The function renders and uploads the objects
  8. You know which objects are present in your video!

🐍 Python libraries

Video Intelligence API

Cloud Storage

OpenCV

Pillow

🧠 Video analysis

Video Intelligence API

The Video Intelligence API is a pre-trained machine learning model that can analyze videos. One of its multiple features is detecting and tracking objects. For the 1st Cloud Function, here is a possible core function calling annotate_video() with the OBJECT_TRACKING feature:

Cloud Function entry point

Notes:
• This function will be called when a video is uploaded to the bucket defined as a trigger.
• Using an environment variable makes the code more portable and lets you deploy the exact same code with different trigger and output buckets.

🎨 Object rendering

Code structure

It’s interesting to split the code into 2 main classes:

  • StorageHelper for managing local files and cloud storage objects
  • VideoProcessor for all graphical processings

Here is a possible core function for the 2nd Cloud Function:

Cloud Function entry point

Note: This function will be called when an annotation file is uploaded to the bucket defined as a trigger.

Frame rendering

OpenCV and Pillow easily let you extract video frames and compose over them:

Note: It would probably be possible to only use OpenCV but I found it more productive developing with Pillow (code is more readable and intuitive).

🔎 Results

Here are the main objects found in the video <JaneGoodall.mp4>:

Tracked object summary for JaneGoodall.mp4

Notes:
• The machine learning model has correctly identified different wildlife species: those are “true positives”. It has also incorrectly identified our planet as “packaged goods”: this is a “false positive”. Machine learning models keep learning by being trained with new samples so, with time, their precision keeps increasing (resulting in less false positives).
• The current code filters out objects detected with a confidence below 70% or with less than 10 frames. Lower the thresholds to get more results.

🍒 Cherry on Py 🐍

Now, the icing on the cake (or the “cherry on the pie” as we say in French), you can enrich the architecture to add new possibilities:

  • Trigger the processing for videos from any bucket (including external public buckets)
  • Generate individual object animations (in parallel to object summaries)

Architecture (v2)

Architecture (v2)
  • A — Video object tracking can also be triggered manually with an HTTP GET request
  • B — The same rendering code is deployed in 2 sibling functions, differentiated with an environment variable
  • C — Object summaries and animations are generated in parallel

Cloud Function HTTP entry point

Note: This is the same code as gcf_track_objects() with the video URI parameter specified by the caller through a GET request.

🎉 Results

Here are some auto-generated trackings for the video <animals.mp4>:

  • The left elephant (a big object ;) is detected:
Elephant on the left
  • The right elephant is perfectly isolated too:
Elephant on the right
  • The veterinarian is correctly identified:
Person on the left
  • The animal he’s feeding too:
Animal on the right

Moving objects or static objects in moving shots are tracked too, as in <beyond-the-map-rio.mp4>:

  • A building in a moving shot:
Shot with buildings 1
  • Neighbor buildings are tracked too:
Shot with buildings 2
  • Persons in a moving shot:
Moving persons
  • A surfer crossing the shot:
Surfer

Here are some others for the video <JaneGoodall.mp4>:

  • A butterfly (easy?):
Butterfly
  • An insect, in larval stage, climbing a moving twig:
Caterpillar on moving twig
  • An ape in a tree far away (hard?):
Ape catching bugs in tree far away
  • A monkey jumping from the top of a tree (harder?):
Monkey jumping from tree top
  • Now, a trap! If we can be fooled, current machine learning state of the art can too:
A flower or maybe not a flower

🚀 Source code and deployment

Source code

Deployment

🖖 See you

Do you want more, do you have questions? I’d love to read your feedback. You can also follow me on Twitter.

⏳ Updates

  • 2021–10–08: Updated GitHub version with latest library versions + Python 3.7 → 3.9

📜 Also in this series

  1. Summarizing videos
  2. Tracking video objects
  3. Face detection and processing
  4. Processing images

--

--

Laurent Picard
Google Cloud - Community

Tech lover, passionate about software, hardware, science and anything shaping the future • ⛅ explorer at Google • Opinions my own