Towards GPU-accelerated image classification on low-end hardware

lnstadrum
Analytics Vidhya
Published in
17 min readApr 20, 2021

Deep learning is not anymore an exclusive privilege of big powerful desktop GPUs. It is trivially true regarding the inference of neural nets: mobile GPUs are handling this greatly, for example. However, when it comes to Raspberry Pi, its GPU is somehow disregarded. There are few-to-no options to get it working as a hardware accelerator for inference.

Raspberry Pi 4 having VideoCore VI graphics becomes an exception though: it is more powerful and is certified to support Vulkan, a recent standard for computer graphics and general purpose computing on GPU, and we start to see libraries using it as a computational backend for neural nets inference, such as ncnn.

But what about previous Pi models with the previous version of this GPU (VideoCore IV)? For some technical reasons, few of which are discussed below, there are no frameworks making possible to run inference of general purpose neural networks on this hardware.

We challenge this in the current article by enabling GPU-accelerated inference of an image classifier on $10 Raspberry Pi Zero W. We do this using GLSL shaders to program the GPU and achieve a throughput otherwise unreachable on this hardware without resorting to external accelerators.

  • We first reuse some of MobileNet, ShuffleNet and ResNeXt design patterns to build a network architecture that complies with restrictions imposed by the use of shaders on Raspberry Pi.
  • We train the model and, thanks to the wide adoption of GLSL, deploy its inference on various hardware starting with Raspberry Pi, continuing with desktop GPUs and ending up with Android smartphones.

Trained on a challenging 120-class subset of ImageNet containing pictures of dogs and cats, our model of 225K parameters reaches 72% top-1 single crop validation accuracy on Pi Zero W and with a throughput of 670 millons multiply-adds per second, thanks to the use of the GPU.

Building the model

Why?

Image classification is likely the most well-researched problem in vision. Why then we do not take an existing trained classification model and deploy it on Raspberry Pi?

The actual reason why few people try to use VideoCore IV GPU as a computing device is arguably not related to its performance (it could be much more powerful compared to the CPU, in particular on Pi Zero W). It is rather caused by its limited programmability. For example, it has is no official OpenCL support. Let us take a look in technicalities, as we need to reason in terms of conformance to OpenGL ES standards, with OpenGL seemingly being the only “official” way to program the GPU on older Raspberries.

OpenGL is a widely adopted computer graphics standard. Its different versions set requirements on what a compliant GPU should be capable of, and define a common interface to access GPU features regardless the vendor, platform, etc. Today, any GPU is compliant with a decent version of OpenGL.

Namely, our Pi Zero W GPU is OpenGL ES 2.0-conformant only, while its successor VideoCore VI as well as any modern mobile GPU is OpenGL ES 3.1-conformant. What does this mean?

The OpenGL ES standard enables the GPU programmability by defining shaders, small programs run in parallel. The version 3.1 introduces compute shaders suitable for general purpose computing on GPU and allowing floating point inputs and outputs, whereas OpenGL ES 2.0 is limited to vertex and fragment shaders common in computer graphics and only supports fixed-point textures. It is then fairly easy to implement the inference of a neural net on a per-layer basis using compute shaders (even though one would need to refine the implementation to get the maximum performance). However, things get much harder when OpenGL ES 2.0 is used.

In this work we follow guidelines we established in a previous article on how to implement inference of a neural net using GLSL. We also discuss there in detail technical constraints imposed by OpenGL ES 2.0 and its Raspberry Pi implementation on what a neural net could consist of in order to be runnable on Raspberry Pi. Those boil down to the following:

  • The activation signals are stored in RGBA textures, with color channels representing feature maps. Textures can only store values in 0…1 range using 8 bit per value, so that the activation values are quantized and limited in range.
  • There is a limit on number of samples that can be fetched from different inputs to produce a single output value. Practically, we cannot sample more than 25 texels from a 4-channel texture, and we cannot bind more than 8 textures to a single shader. This puts limits on the 2D convolution, the main operator in the convolutional neural nets.

None of the mainstream classification architectures comply with these constraints, so we have to design a suitable architecture on our own.

How?

Arguably, one of the nicest things about the gradient descent-based optimization is that one may constrain the compute workflow in a model in an arbitrary way, and it will still learn, and hopefully will not perform too bad.

  • We design a main building block for our model using 3x3 and 1x1 (pointwise) group convolutions that require few input samples per one output sample. We also use a shuffling operation to ensure the activation values being effectively shared across feature channels. The building block instances are then repeated in the model and interconnected with downsampling layers.
  • To get activation signals into 0…1 range when storing them to textures we apply a suitable activation function in hidden layers (bye-bye ReLU!) Also, we use a 16-bit fixed-point representation to encode a single output value in two texture channels in the linear (dense) layer.
  • We actually do not do anything special with the 8-bit quantization of the activation signals, while the right way would likely be to resort to a quantization-aware training technique. We discovered empirically that the validation accuracy does not drop much when we deploy the trained net on the target GPU.

Let us break this down.

The model overview

The model used in a 120-class problem is shaped as follows.

     ╔═══════╦═══════════════════════╦══════════════╦═════════╗
║ Stage ║ Layers ║ Output shape ║ #Params ║
║ ║ ║ (H,W,C) ║ ║
╠═══════╬═══════════════════════╬══════════════╬═════════╣
║ 1 ║ 5x5 conv + BN + act ║ 191, 191, 32 ║ 2528 ║
║ ║ Main block ║ 191, 191, 32 ║ 2432 ║
║ ║ Main block ║ 191, 191, 32 ║ 2432 ║
║───────║───────────────────────║──────────────║─────────║
║ 2 ║ 3x3 conv + BN + act ║ 95, 95, 32 ║ 2432 ║
║ ║ Main block w/ shuffle ║ 95, 95, 32 ║ 2432 ║
║ ║ Main block w/ shuffle ║ 95, 95, 32 ║ 2432 ║
║ ║ Main block w/ shuffle ║ 95, 95, 32 ║ 2432 ║
║───────║───────────────────────║──────────────║─────────║
║ 3 ║ 3x3 conv + BN + act ║ 47, 47, 64 ║ 4864 ║
║ ║ Main block w/ shuffle ║ 47, 47, 64 ║ 4864 ║
║ ║ Main block w/ shuffle ║ 47, 47, 64 ║ 4864 ║
║ ║ Main block w/ shuffle ║ 47, 47, 64 ║ 4864 ║
║───────║───────────────────────║──────────────║─────────║
║ 4 ║ 3x3 conv + BN + act ║ 23, 23, 128 ║ 9728 ║
║ ║ Main block w/ shuffle ║ 23, 23, 128 ║ 9728 ║
║ ║ Main block w/ shuffle ║ 23, 23, 128 ║ 9728 ║
║ ║ Main block w/ shuffle ║ 23, 23, 128 ║ 9728 ║
║ ║ Main block w/ shuffle ║ 23, 23, 128 ║ 9728 ║
║───────║───────────────────────║──────────────║─────────║
║ 5 ║ 3x3 conv + BN + act ║ 11, 11, 192 ║ 14592 ║
║ ║ Main block w/ shuffle ║ 11, 11, 192 ║ 14592 ║
║ ║ Main block w/ shuffle ║ 11, 11, 192 ║ 14592 ║
║ ║ Main block w/ shuffle ║ 11, 11, 192 ║ 14592 ║
║ ║ Main block w/ shuffle ║ 11, 11, 192 ║ 14592 ║
║ ║ Main block w/ shuffle ║ 11, 11, 192 ║ 14592 ║
║───────║───────────────────────║──────────────║─────────║
║ Final ║ 3x3 conv + BN + act ║ 5, 5, 192 ║ 14592 ║
║ ║ 3x3 conv + BN + act ║ 3, 3, 192 ║ 14592 ║
║ ║ Average pooling ║ 1, 1, 192 ║ 0 ║
║ ║ Dense + softmax ║ 120 ║ 23160 ║
╚═══════╩═══════════════════════╩══════════════║─────────║
║ 225112 ║
╚═════════╝

First five stages consist of a downsampling block and several repetitions of the main block detailed below. The downsampling block is a strided 3x3 convolution, followed by a batch normalization and an activation function, pretty much as in the original MobileNet paper. The final stage makes the classifier head adding a couple of convolution/batch norm/activation modules, average pooling, dense and softmax layers.

The main building block is a residual block variant featuring a separable convolution similarly to the main building block used in the first version of MobileNet. There are couple of important differences though.

Our model mainly consists of instances of this pattern inspired by MobileNet v1 block.

Group convolutions and channel shuffling

When designing the main block the sampling constraint comes into play straight away: in wide blocks, i.e. those dealing with many feature maps, we cannot use regular 1x1 convolutions because they require too many input samples per one output sample and thus cannot be implemented efficiently using shaders. Generally speaking, the compute backend we are designing is not pointwise convolutions-friendly. For this reason, we do not use the inverted residual bottleneck block pattern performing in MobileNet v2.

We go with group convolutions instead: the input tensor is sliced across the channels dimension in a specific number of groups, the actual convolution operation is computed per group, and the results are concatenated back along the channels axis.

Regular 2D convolution (top) in comparison with 2-group convolution (bottom). Image from this paper.

Group convolutions date back to AlexNet. Later on, ResNeXt paper showed that while AlexNet used the group convolutions “as an engineering compromise”, they exhibit a stronger representational power compared to the regular convolutions within a residual block. Finally, ShuffleNet paper demonstrated that grouping can be complemented with a channel shuffling operation to make sure that the information is propagated across different groups in subsequent blocks inside the network.

This validates the design of our building block. In particular, having at most 8 four-channel texture samplers being used simultaneously in a shader, we can sample at most 32 feature maps. Therefore, in blocks operating with more than 32 feature maps channels, we shuffle the feature maps by packs of 4 channels. We get this shuffling literally for free in inference: there are no memory copies, we simply change the order when binding textures to samplers in the shader programs.

The shuffling pattern we used is a deterministic one and tries to maximize the cross-block interconnections for blocks of 32 channels. Here is an example of how 128-channels feature maps are shuffled at the 4th stage:

INPUT       ->> Shuffling ->>      OUTPUT
CHANNEL (128 to 128) CHANNEL
0 ┐ ───────────────┬────────────── ┌ 0
1 │ Texture │ Texture │ 1
2 │ #0 │ #0 │ 2
3 ┘ │ └ 3 Textures order table
│ for 3 iterations
4 ┐ │ ┌ 32 of 128-chan. shuffle
5 │ Texture │ Texture │ 33 ────────────────────
6 │ #1 │ #8 │ 34 In #1 #2 #3
7 ┘ │ └ 35 0 0 0 0
│ 1 8 2 16
8 ┐ │ ┌ 64 2 16 4 1
9 │ Texture │ Texture │ 65 3 24 6 17
10 │ #2 │ #16 │ 66 4 1 8 2
11 ┘ │ └ 67 5 9 10 18
│ 6 17 12 3
12 ┐ │ ┌ 96 7 25 14 19
13 │ Texture │ Texture │ 97 8 2 16 4
14 │ #3 ───────┴─────── #24 │ 98 9 10 18 20
15 ┘ accessed └ 99 10 18 20 5
in the same 11 26 22 21
16 ┐ 1x1 conv shader ┌ 4 12 3 24 6
17 │ Texture ───────┬─────── Texture│ 5 13 11 26 22
18 │ #4 │ #1 │ 6 14 19 28 7
19 ┘ │ └ 7 15 27 30 23
│ 16 4 1 8
20 ┐ │ ┌ 36 17 12 3 24
21 │ Texture │ Texture │ 37 18 20 5 9
22 │ #5 │ #9 │ 38 19 28 7 25
23 ┘ │ └ 39 20 5 9 10
│ 21 13 11 26
24 ┐ │ ┌ 68 22 21 13 11
25 │ Texture │ Texture │ 69 23 29 15 27
26 │ #6 │ #17 │ 70 24 6 17 12
27 ┘ │ └ 71 25 14 19 28
│ 26 22 21 13
28 ┐ │ ┌100 27 30 23 29
29 │ Texture │ Texture │101 28 7 25 14
30 │ #7 │ #25 │102 29 15 27 30
31 ┘ ───────────────┴────────────── └103 30 23 29 15
31 31 31 31
32 ┐ ┌ 8
33 │ Texture Texture │ 9
34 │ #8 #2 │ 10
35 ┘ └ 11
. . . . . .

Also, we use groups of 8 channels in the downsampling convolutions and convolutions in the last stage (another difference with MobileNet which uses depthwise convolutions). The rationale behind is to further improve the information sharing across the channels dimension.

Activation function

To use textures to store activation signals we resort to a suitable activation function “compressing” the values into 0…1 range. We tested several options.

Some of tested suitable activation functions having 0…1 output range.

The simplest one is ReLU clipped into 0…1 range. This served us well previously, but fails when used in a deep enough classification network. The model does just not learn, getting stuck at a relatively high loss function value. The possible reason is that the gradient does not backpropagate well through this activation function, as far as its gradient is zero outside 0…1 range.

Another option is the MobileNet v2 non-linearity, ReLU6. Its output range is 0…6, but we can easily stretch the output to 0…1 range before writing out to texture and scale it back when reading. With this activation, the model is learning well, but the final validation accuracy falls short of reaching 70% validation accuracy in a less extensive experiment.

The best performing option in terms of the final validation accuracy was a piecewise-linear approximation of sigmoid. We tried few of such manually designed approximators and kept the best one pictured above (while a better way would be to parametrize it in a learnable way). Compared to the original sigmoid, such an approximation is very efficient to compute in both forward and backward sense. Also, its gradient is piecewise constant in a relatively large domain, from -6.5 to 6.5, and appears “boosted” towards its boundaries compared to the original sigmoid function, which cancels the vanishing gradient issue in this range.

The activation is applied every time a tensor is written out to a texture, namely after every convolution operation, with the batch normalization being fused into the convolution kernel, and before the optional residual connection is added to the convolution layer output.

Dense (linear) layer implementation

Our classifier ends up with a linear layer. Very common.

In inference, dense layer is nothing but the idiomatic matrix-by-vector product plus a constant bias vector, Ax+b. We implement this in GLSL following a multistage blocked GEMV pattern:

  • the input feature vector x is split onto chunks of 8 values, which are multiplied with 8*8 submatrices of the weight matrix A (one submatrix per thread in a first multiplication stage shader program),
  • the partial results are then summed up in subsequent reduction stage shader programs. The last such program also samples and adds to output the constant vector b.

This approach would be relatively straightforward to implement if we do not have the fixed-point 0…1-ranged storages, which we unfortunately do.

Using an activation function that squashes the input signal into 0…1 range is actually not the only option to cope with this issue. Another one is to use multiple color channels to store a single numeric value (it had even been a common practice otherwise).

To implement the dense layer we employ a technique of using 2 texture channels to represent a single feature map. Namely, we cover [-128, 128) range in a fixed-point format, so that one texture channel stores the fractional part and the other one takes the integer part.

The limited range of the fixed-point representation is its major drawback and may cause a failure if ever an intermediate arithmetic operation produces a value falling out of this range. An alternative to avoid this is a floating-point representation (e.g. bfloat16) covering the entire real axis.

However, the fixed-point arithmetic does not introduce any numerical error in additions/multiplications during the matrix product computation (if everything stays within the valid range), while in a floating-point scenario additions are a usual source of error, acting severely during the reduction sums. A test implementation of a matrix product using bfloat16-like datatype with 8-bit exponent turned to be much less precise. Secondly, if the dense layer suffers from out-of-range error in inference, we could mitigate this by regularizing its inputs and weights during the training. Fortunately, we have no need to do this: we validated empirically that the 16-bits fixed point representation is just okay, and our shader-based classifier achieves a very close accuracy to a reference implementation using usual single precision floating-point calculus on a desktop GPU.

This completes the neural net design that takes into consideration the feasibility of the GLSL implementation on Raspberry Pi GPU. The net consumes an RGB texture containing the input image, passes it to a set of shader programs performing computations of different layers and outputs a 120-length vector of logits. CPU does not compute anything except the last softmax mapping applied on top of the logits vector (i.e., computing their exponents, summing up and normalizing to get to the class probabilities). CPU has some managing work to do though: it is responsible for setting up shader programs, binding textures and starting rasterization passes.

Training the model

For the experiments we train the model on a 120-class subset of ILSVRC 2012 dataset. We do this not because a bigger network would not to fit into Raspberry Pi, but simply because of a limited GPU resource available for training and experimentation. A full-scale 1000-class model would much likely be okay to run on Pi: our 120-class model inference takes less than 6 Mb texture memory to store the activation signals (but maybe more to store the compiled shader programs, which is harder to estimate).

We chose 120 classes containing images of dogs and cats, as a matter of being practically meaningful compared to, for example, a random subset of 120 classes. This choice, however, makes the problem harder, as far as instances of different classes look very similar: for a human distinguishing a Siberian husky from a malamute could be much harder than distinguishing a malamute from, say, a bathtub, with all the three being object classes in the original ImageNet set.

ILSVRC 2012 validation set samples showing dogs and why the problem is hard. Three leftmost images are not of the same class: Siberian husky, malamute and Canadian eskimo dog. The rightmost image is wrongly labeled as malamute, found accidentally. The images are property of their respective copyright holders.

The model is trained using Keras in TensorFlow 2.3. Among multiple training recipes, the most performing was to train with learning rate of 0.01 for 400 epochs when it plateaus, then halving the learning rate every 50 epochs till 600 epochs. Adam optimizer was used. The best performing batch normalization momentum is 0.75 (we worked with batches of 64 images). Tesla T4 GPU was used for training taking ~30 min to complete one epoch. To speed up the process, the sigmoid-like activation function has been implemented in a form of C++ extension for TensorFlow. FastAugment was another essential ingredient for speed and better generalization. The training scripts are available here.

In this experiment the model achieves top-1 validation accuracy of at most 71.98%. We found out that averaging model weights from 3 checkpoints reaching the highest validation accuracy near the global maximum gives a further boost of 0.3%, so that the maximum accuracy obtained by the trained model before it gets transformed into GLSL shaders is of 72.28%.

Testing

The inference implementation is done in C++ and uses OpenGL to run the GLSL compute code. The source code is available here.

Single-board computers and desktop GPUs

We first test our model on Raspberry Pi Zero W, Pi 3 Model B+, Pi 4, Nvidia Jetson Nano, and various desktop GPUs. The test is done by running inference on the 120-class validation set (6000 images), taking square center crops of images resized to the target input resolution of 385*385.

Inference time and top-1 validation accuracy on different hardware/OS.

The achieved top-1 validation accuracy matches the reference accuracy up to ~0.2 percent on all the hardware. The small variations in precision may be explained by differences in OpenGL implementations such as support of high precision floating point compute in fragment shaders. Also, on devices that support floating point textures we do not use the 16-bit fixed point packing in the dense layer, which happen to be everyone except the two Raspberries.

The runtime varies from 595 ms on Pi Zero W to just few milliseconds on NVidia GeForce RTX 2070. It is also quite different between the two Raspberry Pi models tested, even though they have the same GPU: Pi 3 Model B+ takes only 328 ms per image. This is explained by the difference in GPU clock frequency (250 MHz vs 300 MHz) and the fact that the CPU reactivity is still important to keep the GPU supplied with the shader programs efficiently: Pi Zero W runs a 1 GHz single-core ARM v6, which has only a fraction of power of 1.5 GHz quad-core ARM v8 on board of Pi 3 Model B+.

Android smartphones

We use the same technique to run inference of our model on Android smartphones. Although we proceed with a fancier test here: we plug the camera directly to the input of our model and predict the class of the image it captures in real time. The camera image in Android is directly accessible as an OpenGL texture which makes the use of our implementation straightforward.

We assess the network performance by filming few images from ILSVRC 2012 validation set shown on a computer screen and comparing the predicted class labels with the ground truth labels. We test few Android devices of different generations this way. Few screenshots showing the camera preview and the predicted classes probabilities for the current frame are shared below.

  • Samsung Galaxy M31 with Mali-G72 MP3 GPU classifies ~8 images per second.
  • Huawei P20 Lite with Mali-T830 MP2 GPU runs at ~5 images per second.
  • Asus Fonepad 8 tablet with PowerVR G6430 GPU dating back to 2014 processes ~2.3 frames per second.
Samsung Galaxy M31 classifying ~8 images per second. Unfortunately, the neural net is unable to make a clear difference between husky, malamute and eskimo dog, but the rightmost picture is likely correctly classified.
Samsung Galaxy M31 classifying ~8 images per second. All the images on these pictures are classified correctly: the predicted class matches the ground truth.
Huawei P20 Lite classifying ~5 images per second. The dataset used in the experiment actually contains 119 classes of dogs and a single “tiger cat” class. Just because a test sample of a real cat was available at hand.

Wrapping up

We described an experimental way to enable GPU-accelerated image classification on low-end devices, which equally applies on more powerful hardware.

To make it work, we cherry-picked few tricks from mainstream classification neural nets to design an architecture that copes with the identified hardware constraints.

  • This gave us a parameter-efficient architecture reaching 72% top-1 single crop validation accuracy on a challenging 120-class subset of ILSVRC2012 having ~225K trainable parameters only.
  • The model performs ~400M multiply-adds to classify an image of 385*385 pixels; the GLSL implementation makes ~28M RGBA texel fetches.

The use of group convolutions, shuffling and an activation function allowing for 8-bit fixed-point signal storage reduces memory bandwidth making the designed architecture suitable for embedded devices. Also, the described inference implementation technique applies on a vast range of GPUs, including inexpensive single board computers, mobile devices, integrated and discrete GPUs from different vendors.

However, GLSL shaders and OpenGL bring a significant overhead, and the resulting speed is incomparable with what an efficient CUDA or OpenCL implementation can achieve on the same devices. For example, a suboptimal tiling limits the performance of 2D convolutions: to compute every four output channels the entire corresponding group of input channels is read from input tensor (so the whole tensor in case of a non-group convolution). This increases the global memory traffic, making the implementation heavily memory I/O-bound, and is arguably its major drawback. This can be tackled with compute shaders or multiple OpenGL draw buffers to tie more output channels to a single shader, but none of this is supported on Raspberry Pi prior to 4 Model B. The resulting throughput is then rather modest: devices having a decent CPU will likely run a bigger network, e.g. MobileNet, faster on CPU. This is quite surely not the case for Pi Zero W though, where it is necessary to program the GPU in some way or use external accelerators to run inference of a neural net.

--

--