Binary Neural Network part 2

Fatou Kiné SOW
6 min readMar 22, 2022

We have previously shown in the first part of this article (see here), that a binary neural network is and how they are trained and deployed efficiently on low power devices in order to gain performance in storage and computing time.

This second part presents the implementation of an inference with a BNN on a Jetson Nano using Larq, an open-source Python library for building, learning and deploying binary neural networks to enable efficient inference on mobile devices.

In a few words, the method consists in training the network on a classical gpu machine (recall that the gradient computation is performed with real weights anyway), and then convert the model in tflite format. Then, a dedicated kernel will exploit binary operators to efficiently run this tflite model. This part will show this procedure in details with code samples, and conclude with experimental results showing benefit with binary AlexNet.

As the kernel exploits low-level optimization techniques, it was developed in C++.
No panic for those who are not comfortable with this language ☺️. This github repository is provided for you to use this kernel in a Python script while keeping its performance.

1. Larq

Larq allows the training of neural networks with very low precision weights and activations, such as binarized neural networks (BNN). The Larq API is built on top of tf.keras and is designed to provide an easy way to design, train and use 1-bit BNNs and other types of quantized neural networks (QNN).

It provides tools specifically designed to aid in BNN development, such as specialized optimizers, training metrics, and profiling tools.

Note that efficient inference using a trained BNN requires the use of an optimized inference engine. To this end, Larq provides Larq Compute Engine (LCE), a highly optimized inference engine for deploying Binary Neural Network on several platforms.

2. Larq compute engine

LCE provides a collection of hand-optimized TensorFlow Lite custom operators for supported instruction sets, developed in inline assembly or in C++ using compiler intrinsics.

It leverages optimization techniques such as tiling to maximize the number of cache hits, vectorization to maximize the computational throughput, and multi-threading parallelization to take advantage of multi-core modern desktop and mobile CPUs.

The following figure illustrates an LCE workflow from learning to deployment on the Tensorflow software stack.

Figure 3.2 : Larq Compute Engine workflow from training to deployment built on-top of the TensorFlow software stack (orange) from training (top), to deployment (bottom). https://arxiv.org/pdf/2011.09398.pdf

Lean more about Larq Compute Engine

Larq also has a larq zoo module that provides pre-trained models with extremely low precision weights and activations (binary for BNN). This module consists of two sub-modules, zoo.literature which contains basic models intended to provide stable references in scientific articles, and zoo.sota which contains the much-performing model.

3. Deploying a BNN with LCE

Next, we will perform the inference of the BinaryAlexNet model, a classical AlexNet model with binary weights and activations. The chosen model contains pretrained weights on the ImageNET dataset.

BinaryAlexnet model Summary :

This model is loaded from literature sub-module of lard zoo.

After choosing the model, we will convert the model to TensorFlow Lite. Indeed, LCE is built on top of TensorFlow Lite and uses the TensorFlow Lite FlatBuffer format to convert and serialize Larq models for inference. Larq provides LCE converters with additional optimizations to increase the execution speed of Larq models on supported target platforms.

Note that the LCE converter currently only supports x86 systems (see the list of wheels here). It is therefore recommended to install the converter on an x86 host machine instead and convert the model from there. The resulting inference engine (the tflite model), optimized for use on mobile platforms, can then be run on our ARM architecture.

To perform inference with Larq compute engine, we will use an LCE-compatible TensorFlow Lite interpreter that uses custom LCE operators instead of built-in TensorFlow Lite operators for each applicable subgraph of the model.

We will first clone the Larq compute engine repository (GitHub larq compute engine) and then use the LCE C++ API to create this interpreter and perform the inference with the larq model converted in TensorFlow Lite.

We will list below the different steps to follow.

1. Import package

#include “larq_compute_engine/tflite/kernels/lce_ops_register.h” #include “tensorflow/lite/interpreter.h” 
#include “tensorflow/lite/kernels/register.h”
#include “tensorflow/lite/model.h”
#include “tensorflow/lite/optional_debug_tools.h”
#include “opencv2/opencv.hpp”

2. Load tflite model :

// Load model 
std::unique_ptr<tflite::FlatBufferModel> model =
tflite::FlatBufferModel::BuildFromFile(filename); TFLITE_MINIMAL_CHECK(model != nullptr);

3. Build the BuiltinOpResolver with registered LCE operators :

// create a builtin OpResolver tflite::ops::builtin::BuiltinOpResolver resolver; 
// register LCE custom ops compute_engine::tflite::RegisterLCECustomOps(&resolver)

4. Build an Interpreter with custom OpResolver :

// Build the interpreter 
InterpreterBuilder builder(*model, resolver); std::unique_ptr<Interpreter> interpreter;
builder(&interpreter);
TFLITE_MINIMAL_CHECK(interpreter != nullptr);

5. Set input tensor values :

we will load an image given as input with the OpenCV library. It is then resized in 224 X 224 pixels and normalized like the training images.

// read image file 
cv::Mat img = cv::imread(image,cv::IMREAD_COLOR);
// convert to float; BGR -> RGB
cv::Mat inputImg;
img.convertTo(inputImg, CV_32FC3);
cv::cvtColor(inputImg, inputImg, cv::COLOR_BGR2RGB);
// normalize to -1 & 1
Pixel* pixel = inputImg.ptr<Pixel>(0,0);
const Pixel* endPixel = pixel + inputImg.cols * inputImg.rows;
for (; pixel != endPixel; pixel++)
normalize(*pixel);
// resize image as model input
cv::resize(inputImg, inputImg, cv::Size(WIDTH, HEIGHT));

We will then define the input tensor and copy the image into the allocated tensor.

// Fill `input` 
float* inputLayer = interpreter->typed_input_tensor<float>(0);
// flatten rgb image to input layer.
float* inputImg_ptr = inputImg.ptr<float>(0);
memcpy(inputLayer, inputImg.ptr<float>(0), WIDTH * HEIGHT * CHANNEL * sizeof(float));

6. Invoke inference

After allocating the input tensor, the inference is performed with the interpreter as follows

// Run inference 
interpreter->Invoke();

7. Read inference results :

// Read output buffers
float* outputLayer = interpreter->typed_output_tensor<float>(0);

These steps are established in the lce_minimal.cc file of the Larq Compute Engine repository.

Next, we will natively compile LCE with a Makefile from the ARM system to create the LCE inference binaries.

We need first to install the toolchain on the target device (see for installing toolchain) and execute the following line to compile LCE.

$ larq_compute_engine/tflite/build_make/build_lce.sh --native

The resulting compiled files will be stored in gen/<TARGET>/ (<TARGET> corresponds to the architecture of the target platform and can be linux_x86_64, rpi_armv7l or linux_aarch64)

Finally, to perform the inference on an image, just execute the following command line:

$ ../compute-engine/gen/linux_aarch64/lce_minimal BinaryAlexNet.tflite testImage.jpg

Full code is available in this GitHub repo

4. BinaryAlexNet vs Alexnet pertained on ImageNet dataset

To compare BinaryAlexnet and Alexnet, inferences are performed on a test dataset of 1000 ImageNet images (10 images per class) downloadable here. These experiments were performed on an ARM64 architecture (nvidia-jetpack 4.6-b197).

Figures 2 and 3 represent the average inference time and the accuracy per class respectively on the 25 classes with the best accuracies.

Inference time on figure 2 is obtained by calculating the average per class inference time on each image.

Figure 2: Inference time per class with AlexNet and BinaryAlexnet
Figure 3 : Accuracy per class with AlexNet and BinaryAlexnet

Figure 2 shows that BinaryAlexnet compiling with LCE is more than 10 times faster than Alexnet on the Jetson with approximately equal average accuracies (figure 3)

It also should be noted that we naively binarized the AlexNet. As shown in [6], a better accuracy can be obtained from dedicated binary architectures.

This confirms the theoretical gains that can be had on embedded devices such as ARMs systems.

For those who want to work with Python, you can clone our GitHub repository (https://github.com/GreenAI-Uppa/larq-compute-engine-from-python) and follow the different steps to perform directly the inference of a BNN on a Python script and benefit from the performance provided by the LCE API. A python module of the C++ function presented in part 3, allowing to perform the inference on Nvidia Jetson Nano, is available in this repository. This repository also allows you to create your python module for this C++ function if you use another ARM system like a Raspberry.

References

  1. Binary Connect https://arxiv.org/pdf/1511.00363.pdf
  2. XnorNET https://arxiv.org/pdf/1603.05279.pdf
  3. Larq https://docs.larq.dev/larq/
  4. daBNN https://arxiv.org/pdf/1908.05858v1.pdf
  5. Binary Net https://arxiv.org/pdf/1602.02830.pdf
  6. An open source library for training Binarized neural network https://joss.theoj.org/papers/10.21105/joss.01746.pdf

--

--