Deepstab: Real-time Video Object Stabilization tool by using Deep Learning

Carlos Toxtli
HCI@WVU
Published in
7 min readMay 25, 2018

By Carlos Toxtli, Claudia Saviaga

Introduction

Problem statement

The ability to locate, identify, track and stabilize objects at different poses and backgrounds is important in many real time video applications. Object detection, tracking, alignment and stabilization have been a research area of great interest in computer vision and pattern recognition due to the challenging nature of some slightly different objects such as faces, where algorithms should be precise enough to identify, track and focus one individual from the rest. An additional challenge is to process videos captured by mobile devices that are often shaky and undirected due to the lack of stabilization equipment on these devices. Even though there are commercial hardware components that could stabilize the image, they are relatively redundant and not handy for daily use.

Some real-world examples in which this technology can be used is for instance for soccer games recorded from non-professional equipment, where a pre-trained neural network model of a soccer ball is used to identify, track and focus on the area near the ball avoiding the shaking undesired effect. This novel approach enables recording devices to track specific objects on real time. Previous approaches implemented on embedded devices are only able to track frontal faces, because the techniques used does not generalize to different objects.

Current approaches

Visual recognition for embedded devices poses many challenges, as models must run quickly with high accuracy in a resource-constrained environment making use of limited computation, power and space. Object recognition implies to deal with some challenges, for instance, face recognition must deal with factors such as head pose, facial expression, image orientation, obstruction in front of the face, illumination, among others. Some commonly used algorithms such as Viola & Jones have proved to be efficient to deal with most of these challenges, but in recent years Convolutional Neural Network (CNN) algorithms have achieved higher performance in computer vision tasks. CNN algorithms usually require more processing and memory, but novel techniques such as single pass CNN detectors, model distillation, and proposals reduction, have decreased the requirements drastically.

Our approach

In this article we explore the use method for face detection and face stabilization using a single deep neural network and state of the art algorithms for face stabilization to work on embedded devices. To achieve this, we implement an approach based on OpenCV: “Single Shot MultiBox Detector” (SSD) with MobileNets and L1-optimal camera paths. SDD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes it easy to train and straightforward to integrate into systems that require a detection component.

We also propose the use of a model architecture called MobileNets based on depth-wise separable convolutions, suitable for use in embedded devices. MobileNets is based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. For face stabilization we generate stabilized videos by removing undesired motions using L1-optimal camera path. This algorithm computes camera paths that are composed of constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To implement the platform we adapted and trained existing Mobilenet SSD face models and integrated existing L1 stabilization algorithms. To evaluate our proposed approach we evaluated WIDERFACE and Open Images datasets.

Evaluation

Factors to evaluate

In order to understand the implications for designing an accurate real-time object stabilization, we evaluated the following factors and set of technologies that might affect the final output:

  • Image dataset: WIDERFACE and Open Images datasets (only a subset)
  • Neural Network Framework: MobileNet and Inception
  • Stabilization algorithm: Feature description based (SIFT, SURF) and non feature description based (crop, resize).

Datasets

We scraped images from Widerface, Openimages (Human Face category) with their metadata. We used our own images as well collected from our FB profiles. Each model was trained with approx 5,000 images.

WiderFace Dataset
OpenImages Dataset

Models training

The first step was to train the model, for that propose we used images from the data-sets as well as manually hand-labeled images with LabelImg (the annotations are saved as XML files in the PASCAL VOC format). The images metadata was transformed to PASCAL XML. Then we used a script that TensorFlow Provides in their Object Detection API to generate the TF Record Files required to feed the Neural Networks. We obtained the following metrics:

Real-time implications

Real time video is perceived fluid depending the number of frames shown per second (FPS). Bluray standard uses 60 FPS to enhance the experience of high quality video, Youtube uses up to 30 FPS in HD content, but the human eye only requires 20 FPS to perceive the motion as fluid, the minimum number of unique frames required are at least 10 frames duplicated to achieve the 20 FPS required. In order to show at least 10 frames per second we need to take at most 100 milliseconds per frame to acquire, process and show the image.

In our tests with a standard Macbook Pro the time required to capture the image from the webcam and show each HD frame was 30 milliseconds, it means that by only capturing the image from the webcam and showing it, it will show it at 33 FPS. It reduces the processing time to a maximum of 70 milliseconds (100 ms — 30 ms).

The following chart shows how long in milliseconds it took to each model to detect objects on a frame:

Stabilization

There are different techniques to stabilize videos, some are feature-based like L1, and use feature extraction algorithms like SURF, in the other hand the non-feature-extraction techniques are typically focused on centering an entire area based on edit actions like crop and resize.

Example of SIFT stabilizer

Example of Crop & Resize

Example of only crop

We also evaluated tracking algorithms that prevents to perform the object detector over each frame, we used KCF a popular tracking algorithm that over-performed in terms of accuracy. We decided to don’t focus on tracking algorithms for this work because the level of accuracy was low.

These are the results of the time in milliseconds that each approach took:

Overall results

After comparing the different approaches and to measure the total time in comparison with the minimum time required to achieve real-time performance we obtained the following results:

Non-feature-based stabilization + SSD + MobileNet was the best approach to achieve real-time stabilization.

Considerations for design

Only the Mobilenet approach was suitable for real-time image processing. It is important to mention that even that tracking algorithms are faster than apply object detection to each frame, the precision needed for stabilizing a face was poor. The tracking algorithms such as KCF, sometimes extended the area detected to the head or to the shoulders, causing a shaking effect on the “stabilized” image. That we observe from the results is that machine learning frameworks and proper images for webcams should be used for training the model.

Conclusions

It is not only needed to have an efficient algorithm to achieve real time face (objects in general) stabilization, it is mandatory to have correct data annotated to have good accuracy and stabilize the sequences properly. Even that tracker algorithms are time effective, the accuracy to track a face is low since the head and shoulders are often included in the stabilized area, making a video that is not really centered in the object stabilized. Its better to perform object detection on each frame.

The only set of techniques that performed with accuracy on real-time was: MobileNet + WiderFace + non feature extraction stabilization algorithm.

Code

You can find the code of Deepstab on this repository https://github.com/toxtli/deepstab

You can find the code for training on this repository https://github.com/saviaga/TensorFlow-Face_Detector

--

--

No responses yet