Running a super-resolution neural network on Raspberry Pi GPU

lnstadrum
Analytics Vidhya
Published in
16 min readMay 17, 2020
Just a photo of my Raspberry Pi Zero W.
Say hi to Raspberry Pi Zero W, a $10 computer having a programmable GPU. In ~3700 words that follow we will curb its power to infer a neural network in order to upscale pictures in a nice way.

Imagine you have an image, say, 4K (3840*2160 pixels) that was resampled to a smaller resolution, say, Full HD (1920*1080). Smaller images take less storage, are faster to process and send through internet. However, when it comes to display this image on your 4K screen, you will likely prefer the original 4K image.

You may indeed interpolate the Full HD picture to 4K resolution using a standard interpolation method (bilinear, bicubic, etc.) This is what happens anyway when the picture gets stretched to the entire screen. And this does definitely not produce the nicest result, although at Full HD/4K scale you may not perceive the difference unless your screen is really big or you zoom in. Still, if you have got a 4K/8K screen, you likely care about the visual quality. So an option for you to upscale the image would be a more elaborate approach rendering a nicer picture at a cost of increased processing time. This article is about such an approach.

The problem of reconstructing a higher resolution image (HR) from a low resolution one (LR) with no other inputs is ill-posed, since the downsampling process producing the LR image from the HR image typically entails a loss of information, so that for a given LR image and an interpolation method there are many HR images that may lead to the same LR output. Standard interpolation methods reconstruct one possible HR counterpart, not necessarily the most naturally appearing one. To reconstruct a visually better HR image one may need to introduce a prior, and this is what neural networks are good at.

The ill-posednesss together with easily accessible data and somewhat emotional context of having clearer pictures for people purchasing 8K screens lead to thousands of papers, each claiming being the best.

This article discusses

  • a small easily trainable fully convolutional architecture rendering 2x higher resolution images heavily inspired by ESPCN network by Twitter engineers,
  • a way to implement the inference solely using OpenGL ES 2.0-compliant shaders with no CPU compute at all. The latter allows to run the model on laptops, Android smartphones and Raspberry Pi, all on GPU.

I do not have 8K screen, but I do have a Raspberry Pi. In what follows I do not try to beat state-of-the-art results, so there will be no high PSNR numbers. Instead, the focus here is on making things practical. Running the inference on Raspberry Pi GPU was the goal of this project, so the main outcome is the use of OpenGL for inference of a neural net on a large spectrum of devices: if Raspberry Pi GPU does it, then pretty much any decent GPU does it as well.

There are images, PSNR and time measurements and some code down there. Let’s get started.

Architecture

The model here is quite an incarnation of ESPCN. The use of OpenGL ES 2.0 as inference back-end and its Raspberry Pi implementation put some constraints on the architecture making the network somewhat uncommon according to modern ML practices, but let accept it as is for the moment and discuss this later.

Main differences with ESPCN are:

  • the use of grouped convolutions followed by pointwise (1x1) convolutions,
  • the activation function is bounded ReLU in range [0, 1]. It is applied on top of all the convolutions.
5x5 convolution, four grouped 3x3 convolutions, 1x1 convolution, two grouped 3x3 convolutions, 1x1 convolution.
The proposed architecture. Main ingredients are 5x5, grouped 3x3, 1x1 convolutions and BReLU activations.

The model operates on a 9x9 pixels neighborhood for a given pixel in LR image and produces 2x2 pixels on output in the way shown below. We deal with grayscale images: the training is done on the luminance (Y) channel, while the inference may be run on R, G, B channels separately, or on Y only and the chroma component is upscaled by a standard interpolation. This latter trick is quite common: the rationale is that any image or video encoder neglects the chroma component applying an additional downscaling to it, and it all works because our eyes are less sensitive to the chroma resolution.

Training and validation

A cookbook with the entire TensorFlow 2 / Keras recipe is available here.

The model has only 7340 trainable parameters. No need of billions of images to train such a tiny thing without risk of overfitting, so a very classic superresolution dataset DIV2K is used for training. It contains 800 HR images and their LR counterparts. Due to the fact that the images are of different sizes, I cut them onto fixed-size patches, e.g. 128x128 pixels in LR domain. Using Mean Squared Error as the loss function is a simple way to go, and quite appropriate as far as PSNR is used as the quality measure.

The training is then performed in a straightforward way: using Adam optimizer and doing some 200+200 epochs with a learning rate drop from 1e-3 to 1e-4 in the middle. It took about 12 hours on my GeForce RTX2070.

The trained model reaches 33.26 dB on DIV2K validation set. This is not a terribly good score, although for a model of 7K trainable parameters this is something. Below comes a figure to position this result in the state-of-the-art. Another figure: EDSR+ network reaching 35.12 dB on the same dataset, which is among the highest today scores, has some 43 million parameters. And we are at halfway between bicubic and EDSR+ with such a modest model capacity!

Figure from NITRE2017 paper displaying PSNR vs inference time for different methods on DIV2K x2 bicubic validation set.
PSNR/inference time scoreboard on the same validation set (DIV2K x2 bicubic). Our model is situated somewhere on the blue line, depending on which GPU is used for inference. But wait, this is in the next chapter. The original image comes from NITRE2017 challenge paper by the way.

Let us now take a look at images.

Superresolution is much like sharpening, so we can consider our model as something which does upscaling+sharpening (while it actually upscales at the very end, strangely enough). The images rendered by our network appear sharper, even though you have to zoom in a lot to see the difference.

Results on a couple of images from DIV2K. In columns, from top to bottom: (1) LR input, (2) OpenCV’s bicubic upscale (baseline), (3) our result and (4) HR ground truth. Our model reaches 2.83 and 2.92 dB higher PSNR compared to the baseline on these two images respectively. Not perfect, but not bad neither: in most cases our result is much closer to the HR image than the baseline.

Fine tuning

Let us now consider a real world application when we do not have the ground truth HR image. Still, to perform some objective (PSNR-based) comparison we can consider any image as HR, produce its LR representative and then upscale it with different methods and compare them.

Here we do so using another very classic SR datasets, Set5 and Set14. We take, for example, the butterfly image as HR and scale it down with a bicubic downscaler at hand (there is one in OpenCV resize function). This becomes the LR image. Then we scale it up with the same bicubic interpolation and our trained model. And the result we get is surprisingly disappointing: the image is ugly sharp and PSNR is worse than the bicubic result!

It turns out that LR and HR images in our training set are related by another bicubic interpolator, likely the one from Matlab, which could be preceded by an antialiasing filter, and the network naturally develops affection to it when it is learning to approximate its inversion. The issue is crazily common: OpenCV’s bicubic is indeed different from Matlab one (people try to reimplement it in Python), and our model is not the only one suffering from this difference: for example, ESRGAN, a star on the superresolution scene, may produce artifacts if your bicubic interpolator is different from the one used in training.

Left to right: original image (HR), bicubic interpolation result, output of our model trained on DIV2K, and output of our model fine-tuned on the same data extended with a bunch of degradation kernels (details follow). The bicubic interpolation reaches 26.90 dB. Without fine-tuning our model fails producing this badly sharpened image (25.05 dB), but catches up and outperforms the baseline once fine-tuned (29.92 dB).

It all makes sense form the research standpoint… but is frustratingly impractical. If you do not have the reference HR image and you do not master the entire image formation process, you do likely not have any sort of “degradation kernel”. In a very general setting you only have an image to upscale in a sharp visually pleasant fashion, and you have no idea how exactly and how many times it has been resampled and what else it may have experienced in its obscure past.

To fix this in a simple way, I extended the training set with LR images produced with as many degradation kernels as I had immediately available, simply taking what is offered in OpenCV’s resize: bilinear, bicubic and Lanczos interpolators, rendered new LR images and added them to the original ones. The pre-trained model was then re-fitted to the new larger data in the same way. This allowed to outperform the bicubic baseline on Set5 and Set14 with a slight gain still producing a noticeable visual difference (and not ugly sharpened images). The price to pay is PSNR on the original DIV2K validation set: the fine-tuned model achieves only 32.57 dB instead 33.26 dB.

The impact of fine-tuning with multiple degradation kernels on PSNR on Set5 and Set14.
Bicubic baseline (left, 24.99 dB) vs fine-tuned model (right, 26.11 dB) on an image from Set14. There is no antialiasing applied when doing the bicubic downscale, so our network learns to correct aliasing artifacts and produce smooth edges. You can see the difference without zooming in, right?

Although the model and its training can be further explored and improved in many aspects, this will be our final model. So let me stop here for the machine learning part and proceed to the inference implementation.

Implementing the inference using OpenGL

Inferring a convolutional neural net on an image typically requires a lot of computing. Fortunately, it is easily parallelizable and thus suitable for GPU.

GPU are here to render pictures, originally. Long time ago graphic pipelines were fixed (non-programmable) and capable of a predefined set of standard computer graphics operations. But using them for doing more general-purpose computations puzzled many of us. There existed a specific term for this, “GPGPU”, that you will find almost nowhere today.

This is because things changed, and now we use GPU to compute pretty much anything by means of a specific interface such as CUDA, OpenGL compute shaders, OpenCL. For example, TensorFlow uses CUDA to speak with GPU. CUDA is a proprietary technology by Nvidia, so if you have a graphic card from another vendor, you will likely not get the maximum of your hardware with TensorFlow any soon unfortunately. But maybe I am too pessimistic: some time ago TensorFlow Lite introduced OpenGL compute shaders support for some models and applications. This allows to help out to CPU when doing some face detection on Android devices that are not massively powered with Nvidia GPU.

OpenGL is ubiquitous. Any decent GPU from any vendor is conformant with some version of OpenGL generally offering a certain level of programmability.

So does Raspberry Pi. I am not talking about the most recent Pi model available at the moment of writing, 4 Model B, whose GPU is OpenGL ES 3.1-conformant making it capable of almost anything a decent Android smartphone is capable of. I am talking here about all the other models of Pi only compliant with OpenGL ES 2.0 standard. This means: no compute shaders but vertex and fragment ones only, no floating point for input/output, no ability to produce a multidimensional output in a single shader (four 8-bit scalars only)…

Regardless, it is enough to run the inference of the model we have just built.

It is worth noticing that if you are in love with Raspberry Pi, there are more efficient ways to access its GPU computing power without OpenGL overhead: here, here or even a Python library for doing GPGPU on Pi here. This becomes then very Pi-specific, but you can likely run faster. Here I keep going with OpenGL willing to run the inference on other devices.

Overview

To put it simple, we implement the operations performed during the inference in form of small programs (shaders) written in GLSL (OpenGL Shading Language). The shaders will also contain the hardcoded trained network weights. All the images and feature maps become textures, all of the input LR image resolution.

GLSL is much like C with some syntax differences and limitations. Shaders are compiled in runtime by the GPU driver into a hardware-specific binary code that GPU is able to execute, much like CPU. But there are differences, mainly due to the SIMD nature of the hardware behind GPU. For example, GLSL not a Turing-complete language so that you cannot recurse in GLSL code in a way you do in C++ or Python. Fortunately we do not need this for the inference of a feedforward convolutional neural network.

Since there is no compute shaders in OpenGL ES 2.0 standard, we proceed in a traditional way where we need a vertex shader and a fragment shader to perform a render pass.

  • Our vertex shaders are trivial: they render a single quadrilateral projecting the entire input onto the entire viewport. I will not detail their code here.
  • Fragment shaders are where the magic happens. They will sample input textures containing the LR input (for the input layer) or feature maps (for hidden and output layers) and compute output feature maps. The GLSL code of fragment shaders is generated by a Python script from the trained model.

To get GLSL shaders running you typically need to write some nasty platform-dependent code setting up an OpenGL context and implementing all the machinery to perform a render pass. I skip the details here; the whole code is available anyway.

We proceed in the way explained above: the Y component of the input gets upscaled by the neural net, while Cb and Cr chrominance channels are upscaled as a regular texture. OpenGL natively supports bilinear interpolation along with the nearest neighbor one (which is by the way used to sample all the feature maps), so the chroma gets interpolated bilinearly. This is not the only way; one may implement a bicubic chroma interpolation in a shader, or apply the neural net on R, G and B inputs successively.

This is our model with each brick being a shader. They are 32 in total. The output of every shader is a texture containing 4 feature maps. They are all of the input LR resolution, except the output image indeed.

Constraints

As mentioned above, our model is shaped by constraints coming from Raspberry Pi OpenGL ES implementation. Let me finally explain this.

  • Fragment shader is a program executed for every pixel. It has a single output, 4-component pixel color written in gl_FragColor variable. Therefore we can only compute (up to) four feature channels in a single shader. It multiplies the number of shaders we need but does not constrain the model size, so we can live with that. It also heavily increases the memory bandwidth, since the feature maps textures are sampled many times… but there is no other way with GL ES 2.0 on Raspberry Pi up to my knowledge.
  • All the feature maps values are sampled with 8 bit fixed-point values in [0, 1] range. A way to overcome this is to use an activation function whose output range fits into [0, 1]. This is why we use [0, 1]-bounded ReLU as activation function everywhere. Actually, the simple fact of writing to gl_FragColor clamps the value to [0, 1] range, so we do not even need to implement it explicitly: GLSL applies the bounded ReLU anyway. Cool!
  • Fragment shader has a limited number of input textures, at least 8 according to the standard. For Raspberry Pi it is exactly 8. Since textures are (at most) 4-channel images containing RGBA colors, we end up with at most 8*4=32 feature maps on input. This is an actual constraint: to compute a 2D convolution we need access to all feature maps in a single shader. Otherwise we have to split the convolution among different shaders, each sampling at most 32 channels, and then use another shader putting the partial results together… It quickly becomes a mess and may be unfeasible due to the 8-bit shader output sampling constraint. Therefore, all the feature maps can have at most 32 channels.
  • There are two extra conditions limiting the number of input channels. Firstly, there is a limit on the number of texture sampling operations per shader (64 for Pi). To compute 3x3 convolution, every texture gets sampled 3*3 times. With the limit of 64 samples we can have at most 7 textures, so 28 feature maps. But for 1x1 convolutions it is not an issue.
  • Secondly, there is a limit on the total number of instructions per shader. 3x3 convolutions of multiple input feature maps are the most greedy in this sense. An implementation with 12 input and 8 output feature maps passes on all the hardware I had at hand (although it may be tweaked further). There might be a way to get more feature channels going with the depthwise convolutions like in MobileNet, but it leads to a model with yet smaller capacity and this does not seem to perform well in few tests I did. Therefore we rely on the grouped 3x3 convolutions of 12 feature maps putting pointwise convolutions on top of the grouped blocks to mix up their feature channels. This is the key design decision shaping the model and giving 48–32–24–16 feature maps on output.

This is it, we now have a trained model that fits the hardware. There are few things remaining to get it running!

GLSL implementation

Converting such a fully convolutional model into a bunch of GLSL shaders becomes simple once it respects the hardware constraints: all we need is to take the weights and biases from the trained model and implement the convolutions in fragment shaders. Accessing the trained parameters of a layer in your favorite machine learning framework is generally not a problem (as simple as layer.kernel.numpy() and layer.bias.numpy() to get the Numpy arrays in TensorFlow 2 / Keras), so a Python script would do the job.

As for storing the network parameters in GLSL, there are different options, for example to put them into a separate texture or a uniform variable. However, the model is small, so the simplest option is to expose the weights and biases as hardcoded constants in GLSL code. This is likely the most efficient way too.

Here is how the very last 1x1 convolution shader (the fifth layer of the model) looks like. It is the smallest shader in terms of the code size among the 32 ones. All it does is sampling the 16 input feature maps (4 textures of 4 channels each), convolving them with the learned kernel, adding the bias and writing the result out to the fragment color variable.

I used dot GLSL function to implement the convolution. It appeared quite efficient in some preliminary trials I performed on Raspberry Pi, but there are indeed other ways to organize the computations.

Together with the convolutional shaders, one last fragment shader we need is the one merging the nicely upsampled luma with the cheaply upsampled chroma (the one of salmon color in the scheme above). Indeed, the fifth layer output is 4-channel texture of the LR input size; at each pixel position its 4 channels contain 4 pixel values of the output HR luminance. So we demultiplex these values using gl_FragCoord GLSL variable and add the chrominance from the input image in this last fragment shader:

The Fun Part: Testing on all the hardware around

Raspberry Pi

The whole thing is designed to run on Pi, so let the tests begin on Pi!

To upscale a 256*256 input image to 512*512 pixels, Raspberry Pi 3 Model B+ runs the inference in ~130 ms in average (with 9.5 ms of standard deviation over 10 repetitions). Raspberry Pi Zero W goes a little slower (171 ms), having the same Broadcom’s VideoCore IV GPU onboard running it at a lower clock frequency. The shaders get compiled in 3.5 to 4 seconds (to remind, it is done once and not for every image we may have to process).

Is this slow? Well, rather yes. But we just managed to run the inference of a neural net on Raspberry Pi GPU, this is still great.

For the sake of visibility an image of Set5 (top left) gets downscaled here by factor of four (top right). Then it is upscaled back using the bicubic interpolation (bottom left) or applying our model twice (bottom right). We got 22.82 dB, the bicubic finishes at 21.41 dB. And yes, the bottom right image comes straight from Raspberry Pi.

Android smartphones

Android smartphones have OpenGL ES-compliant GPU too, so there is no way for them to escape from our tests. Here are figures of performing the same 256*256 to 512*512 upsampling test on some Android phones.

I do not put any PSNR numbers here: there is no visible difference between images coming from different GPU. The results may differ on different hardware indeed, but they remain very similar. Thus, among three images rendered from the same source on Pi, an Android smartphone and a high-end desktop GPU the two most different ones are at 45.8 dB from each other.

Another test would be passing a small-resolution camera preview through the network in real time. Without any additional pixel transfer, the camera image may be accessed as OpenGL texture through samplerExternalOES sampler in GLSL, so we can directly plug it to our network.

Huawei P20 Lite manages to run the upsampling of a 352*288 input to 704*576 output at 6 to 7 FPS. My old Asus K016 (Fonepad 8) does the same job at ~13.2 FPS. Huawei P10 runs at 30 FPS!

Running the inference on 352*288 camera preview image on Huawei P20 Lite. Left: natively (bilinearly) upscaled camera preview; right: our network result. Take a closer look at the lamppost in the middle, the cable to its right, roofs contours, fine tree branches, road marking. Indeed, this app is somehow ridiculous since the camera is capable of a way better resolution. Still, passing a camera preview to a SR neural net on Android GPU in real time is fancy, isn’t it?

Desktops

Desktop GPU are generally (much) more powerful. So together with feeding them with the 256*256 butterfly image from Set5 we challenge them with something bigger: a FullHD input to get upscaled to 4K.

Here are some figures:

Among numbers for desktop GPU there are also inference times for Nvidia Jetson Nano. It is rather a mobile GPU than a desktop one, but in this test it competes easily with the integrated desktop graphics.

The running time is actually quite a variable quantity, and depends not only on the GPU model and hardware characteristics. For example, it gets slower when the same GPU is used by the system to render the GUI, when the power cable is not plugged in on laptops, etc.

Wrapping up

If your washing machine has a GPU, it is very probable that the neural network we have designed here can be run on it.

This is all I have so far was about a tiny cute neural upsampler running on (almost) any GPU you may find around. To get there,

  • we first trained a small (7.3K parameters) neural network for x2 upscaling inspired by ESPCN paper. Although PSNR does not rock compared to the creamy top of the state of the art, the images are rather nice (quite much nicer than the bicubic baseline). There is a room for improvement in model design, training and fine tuning.
  • Then we converted it into a bunch of OpenGL shaders. The inference runs on Raspberry Pi, Android smartphones and desktops, and it only uses GPU for computing. Low-end GPU are not too fast, but a good desktop GPU can upscale FullHD to 4K in real time (some 100 FPS). You can imagine now watching a movie encoded in FullHD on your 4K screen getting nicely upscaled in realtime by your GPU.

The Tensorflow/Keras implementation is available here, the inference implementation is here.

--

--