Feasible video processing on Hivecell

Dasha Korotkykh
Hivecell
Published in
3 min readJun 1, 2020

Recently, we did a small demo in which we processed a simple temperature differential dataset with Confluent Kafka at the edge — and got a sweet 9.3x improvement of efficiency.
This time Pavlo Lobachov and George Barvinok at Hivecell have experimented with an affordable method of processing a video feed at the source which can also provide anonymity for those observed — let’s see what numbers we get.

The set up remains simple:

a) Raspberry pi-cam;
b) Hivecell unit running
(remember this when we get to the power consumption):

* MQTT Proxy, which collects the images from the camera, and a Zookeeper server;

* Object detection models (we experimented with pre-trained MobileNet Single Shot Detection v2 and YOLOv3 on COCO dataset);

* Three Kafka Brokers, Confluent Replicator, and Kafka Streams.

c) com.rlr topic at Confluent Kafka cloud receives JSON objects with recognized key values.

Customers might have different objectives in mind — the speed of processing or quality and quantity of recognition per frame, depending on their business case. Let’s compare the outputs we got from the standard parsing libraries:

*brackets on the snapshots below are added manually to highlight recognized objects

SSD MobileNet v2 is faster but shows a lower recognition confidence rate.

On a pi-cam shot from a regular office workspace with a medium lighting level though it detects a “person” object with surprisingly good quality, capturing a person on the foreground and another one in the background (we sincerely didn’t even know someone was there before running the test).

Dry run with the YOLOv3 library on a similar image captured six different items with a higher confidence parameter, averaging in 0.763.

For the third test we ran MobileNet SSD again, but this time an optimized build for GPU.

Average parsing confidence per frame went down, but it is important to remember that this time the feed was 40 frames per second, 3.5 times more frequent.

On average, 10–12 frames are enough for consistent video processing, which would allow us to process feed from 4 cameras simultaneously.

When the objective is to record the metadata of objects from the image and not to replicate and store the image itself, processing the image locally, and only sending object records in the cloud naturally reduces storage, cloud computing expenses, bandwidth and power utilization. But using an average datacenter server for this would be overkill by both capacity and expenses.

The Hivecell edge server that manages the SaaS layer produces real business value. With the GPU-optimized build, the whole hardware unit consumed only 17 Watt for an hour of computing. Take a look at the numbers:

Processing 40 fps results in 14400 frames in an hour. That means one Hivecell processes 8.7 frames per second per watt — convert this to money!

If you are a curious engineer like us, comment below with questions about our demo or mention a case study that would be of interest to you — we would love to investigate it.

--

--