How to Implement a Neural Network with only +1 and -1 ?

Published in

Analytics Vidhya

4 min readApr 16, 2020

Deep Neural Networks have been a great tool for the last decade in many areas from image recognition to speech recognition, from machine translation to games like AlphaGo. They are quite powerful thanks to millions of parameters ( neurons ) and their immense learning ability. According to “Universal Approximation Theorem”, a neural network with a single hidden layer and finite number of neurons can approximate continuous functions on closed subsets of R^n. Meaning, it can represent a wide variety of functions with sufficient and appropriate parameters. For this reason, they are called “Universal Approximators”.

Although they are quite powerful and mostly preferred to solve challenging problems, they come at a huge cost. In the last decade, many models were built for ImageNet classification challenge. Each model aimed to outperform the previous model in terms of accuracy score. If we compare AlexNet ( 2012 ) with ResNet ( 2015 ), we will see a huge reduction in error. 16% error dropped to 3.5% in ResNet. However, the cost comes into play when we compare the model structures. Although AlexNet consists of only 8 layers, ResNet has 152 layers! Good performance isn’t for free, on the contrary, it is quite expensive.

As mentioned above, Due to Deep Neural Networks’ structure consisting of multiple layers and millions of parameters, many new challenges arise:

1- Speed

As the number of parameters increase, the more time is spent for DRAM access and addition/multiplication operations. For instance, although ResNet18 took 2.5 days to train, ResNet152 took 1.5 week.

2- Memory Space

As expected, using millions of parameters lead to an extreme use of memory space. Based on 60M parameters used in ResNet152, we can find out that the whole model’s size is more than 200 Megabytes, assuming that each parameter is a 32-bit floating point. 200 Megabytes might not be a problem for a desktop computer. But it is a big problem for small devices like mobile phones. An application implemented by using such a Deep Neural Network ( DNN ) model requires hundreds of megabytes. That’s why it is also challenging to distribute such large models by internet updates.

3- Energy Usage

As in memory issue, it is difficult to run such applications on low-power devices since they drain battery. The more size is accommodated to represent the parameters, the higher amount of energy is consumed. The table below indicates the difference between 8-bit integer and 32-bit floaating point in terms of energy.

+-----------------------+----------------+----------+
| Operation             | Multiplication | Addition |
+-----------------------+----------------+----------+
| 8-bit integer         | 0.2pJ          | 0.03pJ   |
+-----------------------+----------------+----------+
| 32-bit Floating Point | 3.7pJ          | 0.9pJ    |
+-----------------------+----------------+----------+

As we can observe, using 32-bit floating points instead of 8-bit integers increase the energy required by almost 30 times! So, if we can find a way to represent our weights by smaller words, we will be able to decrease the energy drastically for these operations.

Famous AlphaGo model was trained by 1920 CPUs and 280 GPUs,which costs $3000 per game!

Solution

To cope with these hardware-related problems, many different methods were developed. In this post, we will be discussing about Binarized Neural Networks( BNN ). It suggests training Neural Networks with weights and activation functions constrained to +1 and -1. Unlike conventional Neural Networks with thousands, or even millions, of different weights which are represented by 32-bit floating points, BNN’s weights can be represented by a single bit. In classical binarization, the value is either 0 or 1. However, in this case, not to lose any information multiplying by 0, we use -1 and +1. Having such an implementation leads to 32x less memory consumption.

In addition to having binary weights, activation function is also binarized. There are two options to implement a binary activation function, either deterministic or stochastic. We can use sign(x) as deterministic activation function:

#Implementation of a sgn(x) function on a vector
def sign(x): 
     x[x >= 0] = 1 
     x[x < 0] = -1 
     return x

A deterministic function is way easier to implement since it only compares the number with zero. A stochastic activation function would require generating some random bits that would increase the hardware necessity which we are already trying to avoid.

Besides, binarization of parameters allows us to perform bit-wise operations. A complex 32-bit floating point multiplication operation is reduced down to a simple bit shift operation in BNNs. Thus, BNNs significantly reduce the memory and energy consumption and increase the run-time speed by having only 2 weights which does not require millions of DRAM accesses.

What about accuracy? Indeed, we would expect a significant drop in the prediction performance. However, according to the paper[2] which suggested using only +1 and -1 as weights in Neural Networks, almost state-of-the-art results were obtained on well-know datasets like MNIST, CIFAR-10 and SVHN.

Briefly, BNNs pave the way to reduce memory/energy consumption and also increase the run-time speed with a new approach in Deep Learning.

This post was co-authored by MERVE TURHAN.

References

[1]- Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary

[2]- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1"