Off-line recognition with small battery power? Sure!
Authors: Xperience.ai team | Anastasiya Reshetova, Maxim Zemlyanikin, Alexander Smorkalov
There are a lot of Deep Neural Network (DNN) models for Computer Vision. These models have made great progress in terms of quality, performance, and range of tasks over the years. But these models are not often ready to be deployed on embedded devices.
In the past several years, the research community has raised its attention to this problem — papers and challenges devoted to energy-efficient networks started to emerge as low-powered hardware devices appeared on the market. Also, people become more interested in on-device solutions than in cloud applications mostly because of privacy concerns.
Xperience.ai was equipped both with the scientific stack and actual hardware support. We were able to develop and port on the device face recognition model with a live QVGA camera and displayed preview on a battery-powered board. It is also important that all data is stored on the device and not loaded anywhere externally.
The battery-powered board Gapoc A with SoC GAP8 we used was developed by Greenwaves Technologies — it is a fabless semiconductor company, specialized in solutions for image, sound, and vibration processing of AI in sensor devices. We collaborated with them to build computer vision algorithms. This blog post walks through detailed steps that we took to build this on-device solution using PyTorch.
Idea of the project
The idea was to create a smart doorbell, which could be used in apartment buildings as an entrance controller. Every time somebody would want to enter your apartment, you would receive a notification. Family members and close friends could be added into a small database located on the device itself and not loaded to any kind of cloud. And then a smart doorbell can check if a new guest is one of them and let the person in.
You could also get a notification when, for example, your children enter the home and not worry about them, or use the smart doorbell to check if somebody wants to go inside your house while you are out.
We worked with Greenwaves Technologies GAP8 SoC. It provides a general-purpose RISC-V computing unit (FC) on the same die with a cluster of 8 RISC-V cores with extended RISC-V instruction set architecture.
The chip includes 512 KiB shared L2 memory accessible for all cores, 64 KiB L1 tightly coupled data memory for the compute cluster, and 16 KiB of L1 memory owned by FC. GAP8 can also use external L3 RAM memory and flash memory, but it cannot be directly addressed and all operands should be loaded to L2 first.
Application pipeline overview
Face detection and recognition pipeline consists of four stages:
- frame capture
- face detection
- face recognition
- user interaction with external event activation
The application starts with capturing a frame from the camera, then detects faces, followed by recognizing faces. As a result of the re-identification (ReID) network, a descriptor is assigned to a face. This descriptor is compared, using L2 distance, with a database of precomputed descriptors of known users. If it is close to one of them, the person is recognized. If not, the descriptor goes to the strangers’ list. The application also provides functionality to add known users to the trusted list using face descriptors. Strangers can be added to users list later if needed. The strangers’ list is deduplicated with the same L2 descriptors comparison to save device memory.
Face Detection Model
GAP8 SDK already includes a cascade-based detector with Haar features demo derived from OpenCV implementation. The existing face detector provides a very good trade-off between performance and quality in our case — so we decided to re-use it in the solution.
Face Recognition Model
There are a few known small-sized architectures that may be suitable for our task in terms of the number of parameters and flops:
- MobileNet (V1, V2)
- SqueezeNet (1.0, 1.1)
- ShuffleNet (V1, V2)
Squeezenet 1.1 has the fewest number of params and doesn’t have residual connections that strongly affect memory consumption. Also, depthwise convolutions were not implemented on our target device at the moment. That’s why we decided to use Squeezenet 1.1.
We tried different loss functions for Face Recognition training: standard Cross-Entropy loss (XE), CosFace, Lifted Structured loss (LS). And several distance metrics for validation: L1, L2, cosine distances, and L2 along with embedding normalization (L2norm). Here are the results of our experiments:
Despite the fact that the network trained with CosFace loss showed the best quality with cosine distance and L2norm, we decided to use XE + LS loss function and L2 distance for evaluation. The difference in quality is not so big, but L2 distance is much simpler in implementation, especially when it comes to fixed-point computations.
We also tried to use Batch Normalization in convolutional layers of the network, and as it showed significantly better results, we decided to use it farther in our development.
It’s also important to note that Batch Normalization improves the quality without affecting inference time, as it was merged into convolution weights.
The results provided above were measured with aligned faces. It means that there was a preprocessing step that shifted and rotated faces to make landmarks (eyes, nose, mouth corners) coordinates close to the predefined ones. It is a common practice in a research community and in production, too, but we didn’t have enough resources for additional landmarks detection step. That’s why we made an experiment to figure out how face alignment affects face recognition quality.
We ran a lightweight face detector from the OpenCV DNN module on the LFW dataset and used predicted bounding boxes as inputs for the face recognition neural network. The result was 0.9707, which is not as good as the result with aligned faces, so we need to consider an integration of face alignment into the pipeline for future work.
All weights and activations in the neural network are floating-point numbers. But operations with them are too resource consuming. The convolution operation at GAP8 SDK uses 16 or 8-bit integer values. It convolves 16-bit signed integer input with weights, stores an intermediate result in a 32-bit accumulator then shifts it to the right by norm bits, the least 16 bits are saved and, finally, 16-bit integer convolution bias is added to the result.
To convert floating-point numbers to fixed-point numbers, we need to use quantization. Common quantization tools are not convenient for our project because of some reasons:
- using scale parameter to represent tensors
- different concatenation from our implementation
- different quantization of average pooling layers
To avoid these issues we suggested an alternative quantization scheme developed with PyTorch. Unfortunately, the PyTorch quantization feature was not implemented at that moment. Now, this step could be done in an easier way. But we can say that our implementation is quite similar in logic to PyTorch one, so we went in one direction. We quantized the neural network to 8 bits and to 16 bits and found out that the 8-bit version has a huge quality drop. So we decided to use a 16-bit quantization in our project. It consists of following steps:
- Step 1: Estimate the number of bits on int part of input, output, and weights of every convolution.
- Step 2: Calculate the number of bits left on the fraction part of each value:
- Step 3: Quantize convolution weights and biases into fixed-point values:
- Step 4: Choose the norm parameter for convolution — the number of least significant bits in the accumulator to be deleted:
It is also possible that a 32-bit accumulator of output activation overflow. To avoid this situation we estimated integer bits for accumulator ACCint from the data and checked that:
Otherwise, the fraction part should be reduced:
Porting to Target Platform
GAP8 SDK provides a set of optimized CNN kernels for popular layer implementations that allows working with CNN applications on the per-layer level. Autotiler library provides all the compute operations that we needed (3x3 and 1x1 convolutions, average and max pooling, ReLU, and concatenation) and also gives an opportunity to fuse blocks of convolution, ReLU, and pooling for smaller memory footprint and better performance.
The memory footprint of our model didn’t fit into RAM and we had to use external (L3) memory as well. At the same time, per layer inference on GAP8 SDK requires layer input, activations, weights, and biases stored in L2 during layer inference. We are able to re-use previous layer output in RAM (L2) without extra transfers using special buffers allocation logic.
There are two major cases of memory re-use during inference:
- Convolutional layers sequence when the output of i-th layer is used as an input of i+1-th layer:
- SqueezeNet Fire module consisting of squeeze convolution and two expand convolutions:
Another way to minimize memory consumption is to re-use memory on a pipeline level:
- We can re-use original camera buffer for CNN part as soon as the face area is extracted
- We can throw away the intermediate data used by the face cascade before CNN inference and re-use it as well
After all these operations with memory, we were finally able to squeeze our entire application pipeline into 512 KiB L2 memory and run it on the target battery-powered device for less than 1 second. It was achieved due to using a small network architecture — with less than 1.5 Mb of weights, and a novel quantization pipeline. At the same time, 16-bit quantization (instead of 8-bit) made it possible to achieve a sufficiently high quality — 0.9707 accuracy on the LFW dataset.
So as a result of research we have got the solution that fits the production requirements both on the recognition quality and inference speed.