Distributed Deep Learning Video Analysis using Spark And Kafka

Published in

metaliquid

6 min readApr 10, 2019

In the last years great efforts were made to bring to the community the latest state-of-the-art results in the world of deep learning. We can see numerous public repositories related to published articles which allow everyone to study and understand the amazing progress made in the academic world of deep learning.

But, when it comes to bringing those amazing efforts to real the real world it’s not so easy.

How can we transfer all the knowledge acquired from the academic world into a production-ready environment integrated into a customer’s workflow?

One of the main issues we faced was object detection. We approached the problem investigating the current literature and we found various approaches from which we built a knowledge baseline to develop our own custom object detector.

An example of these awesome models for object detection is Yolo2 which was built by its author on top of a completely custom deep learning framework, terrific in classifying holidays pictures but not well suitable for real world problems.

Fig.1 — Terrific results on holidays pictures with Yolo2

We enjoyed playing with those technologies so much that we wanted to apply them to video content such as movies, tv series, live shows and see if, for example, we could also correctly identify all such animals in movies like Ace Ventura: When Nature Calls.

For this purpose, we need an infrastructure which allows us to extract all the frames in the video content and analyse them with multiple models in order to identify different concepts and provide and output a descriptive metadata about what is happening in the selected scene. What is being described here is an architecture which satisfies multiple requisites:

Easily integrated with existing environments
Ability to scale up/down depending on customers needs providing real-time metadata of the analysed video content
Capability to be monitored and view the results
Providing some fault tolerance
Able to interact with different hardware such as GPU or CPU

Our previous experience from the big-data world suggested a good way to tackle the problem: a distributed approach to deep-learning based analysis of video content using top-notch technologies and frameworks such as Apache Spark and Apache Kafka the big-data driven database Apache Cassandra, everything enriched by a JVM environment and Scala programming language.

If none of these names are familiar with you don’t worry, I’ll explain in the following sections why exactly we need them.

Apache Kafka

Apache Kafka is a streaming platform which from their website is presented as a platform with three key capabilities:

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.

The key point here regards the stream of records which in our case is a video stream composed of serialized video frames pushed on Kafka which will work as a distributed queue.

The advantage of this approach to the extraction of the frames of a video allows us to completely decouple the source of data (which can be a single video, multiple videos, a single or multiple live video stream from UDP sockets) from the effective analysis pipeline.

Using this approach with multiple video sources we are now able to provide to the real analysis core software a distributed queue where all the frames of the video sources are stored, which is a perfect scenario for a distributed and scalable application.

Apache Spark

Spark is described by its creators as “a fast and general-purpose cluster computing system”.

In literature you can find a huge amount of material regarding Spark as it’s the commonly selected software used for big-data analytics.

In few words, it’s a distributed engine which allows the developer to forget about the distributed part of the software and focus on the analytics logic. All collections in Spark are distributed collections which can be obtained from numerous data sources and, guess what, Kafka is one of the most commonly used data source for such computations.

For this purpose we are going to use an extension of the core module of spark called Spark Streaming.

With Spark Streaming we can literally plug our computation to a potentially unlimited source of data and perform a completely distributed analysis on a cluster with a dynamic amount of resources depending on the actual data load.

Here is where the real computation happens: every node in the cluster is equipped with the desired hardware (generally GPUs for a faster computation), and has the capability to use the preferred deep learning models with the preferred deep learning framework (TensorFlow, PyTorch, MXNet, Caffe.. etc).

Every node in the cluster will not be specialized in a different task but potentially will be able to provide metadata about every single classification task in the system (face detection, object detection, brands detection... etc) over every possible frame of the video stream.

What is really interesting about this architecture is that you don’t need to distribute the frames of the video through the whole cluster but you can just focus on your inference and metadata aggregation logic.

Apache Cassandra

The purpose of this architecture is the extraction of descriptive metadata to be as precise as possible about what is happening in a video and we need a place to store them.

The amount of frame-by-frame metadata extracted is a factor which led us towards the use of a big-data driven database such as Cassandra. This database allows us to scale up and down with nodes regarding the amount of data we expect to find and allow us to be ready for future new models and classification patterns.

All Together

In Fig.2 you can see the overall architecture of the application proposed. All the pieces described previously fit together allowing a completely distributed video analysis.

The contact point for the users will be the Cassandra DB, on which we can build applications able to interact with metadata extracted from the analysed video contents and show them in the way we prefer.

The real core of the artificial intelligence is the Apache Spark infrastructure which allows us to completely forget about all the network and distribution-related issues and focus just on the real machine learning algorithms.

Fig.2 — Overall distributed video-analysis architecture

Results

As promised in the introduction (and probably the reason you reached this point of this article) we are now able to show the results of a video content analysed with the presented infrastructure.

Results are visualized through a web interface where the extracted metadata are directly plotted on the input video.

Fig.3 — Data visualization with Metaliquid player

Now elephants and jungle animals are recognized as entities in this small video along with people and vehicles and the data is persistent and saved on Cassandra ready to be exploited and analysed.

Conclusion

In this article we did not go through many technical details about the implementation of the workflow, but we presented a framework which is to be intended as a general structure for a distributed video-analysis.

This approach gave us optimal results for offline analysis, the ability to scale up & down with resources gave us the perfect flexibility to reach the performance we wanted in terms of analysis speed.

Spark Streaming however is not a real stream processing but rather a batch-based processing and, even using small batches of frames we would necessarily have a small latency between data analysis and the produced results, making the whole approach not really suitable for real-time stream analysis.

I hope this gave you an idea on how academic problem solving can be tackled and implemented in the real world and of course, for any question do not hesitate to ask.

DISCLAIMER

“No animals were harmed in the making of this article”