Client-side neural networks: how to get started and our story
The popularity of machine learning and neural networks is growing every day, and faster than we could ever have imagined. Many industry players are contributing, and client-side frameworks are developing rapidly. Many libraries, such as TensorFlow, are constantly releasing new versions. More mobile devices are now receiving hardware support for machine learning (even Google released its NNAPI — Neural Networks API on Android 8.1). The field is receiving a lot of attention at the moment.
With millions of users worldwide, Bumble — the parent company operating Badoo and Bumble apps — processes a significant amount of information every day (especially photos). With the help of a team full of experts such as data scientists, data analytics, and R&D people with in-deep knowledge in machine learning and the moderation team to markup datasets, we believe we provide both the best environment and user experience.
In this article, I share with you our very first experience using neural networks (NN) on the client-side. In particular, I describe the process of working with the TensorFlow Lite library. I also give an overview of our preparations, a brief description of the neural networks’ world, and, to finish, full implementation details.
This might be useful for those starting out with neural networks or anyone just interested in hearing about our struggle. So, let’s get started!
All our users need to upload a picture for their profile. Users may request self-verification using additional photos. Once verified, the user receives a badge on their profile for other users to see. The process looks like this:
- We ask the user to send us a photo of them showing a requested gesture.
- We process the photo to decide if it’s the same person and if the gesture matches their requested one.
Initially, we had two-step photo-processing. In the first step, server-side neural networks were used to filter out obviously unsuitable photos from further processing. In the second step, our moderators processed photos one by one to ensure if they fitted the criteria. Our goal was to increase the percentage of successful photo-verification attempts. A side benefit was that it also reduced the number of manually processed photos.
The updated pipeline of the photo-verification process looks like this:
We developed a set of requirements trying to deliver the best experience possible. There were two key requirements applied to the NN model:
- Speed. NN should be as fast as possible. It should work in real-time. The response should be prompt, so the changes in the input image would be reflected in the output in a reasonable time frame.
- Size. NN should be as small as possible. We decided to deliver the model over the network, so the smaller the model, the faster the user gets it.
So, we don’t bundle the model inside the APK to keep it small. To further optimise the APK size we are considering using Dynamic Delivery feature in the future. Yury has written a great article on this (Part 1, Part 2).
Beginning of development
Starting with the investigation I knew that we were already using Tensorflow on the server-side and that it is the most popular platform for machine learning at the moment. I found that TensorFlow has 2 options for running on mobile devices — TensorFlow Lite and the older TensorFlow Mobile (not its official name). The R&D department provided me with the first models that were in the protobuf serialised graph format (“.pb”). Such models were compatible with an old TensorFlow library only. They were good enough to start with and to play around with but we knew that we could use TensorFlow Lite for our purposes and that it offered higher performance and would allow us to run the model on the GPU. Remember, one of our key requirements was high speed. GPU allows us to run multiple calculations in parallel so this gives a significant boost in speed compared to running the same model on the CPU. Thus, we decided to give it a go and convert the model to the new format.
Overall, TensorFlow Lite provides 3 ways for running the model — on CPU, on GPU and using NNAPI. NNAPI was introduced in Android 8.1 and you will benefit from using it if the device contains specialised hardware for running neural networks. Unless you don’t have such special hardware, you will not see no significant difference compared to running on the GPU.
Converting an existing model to the TFLite format was a long iterative process. We started with automatic conversion and trying to run the resulting model on the CPU. There were no problems with the converted model, it ran as expected on the CPU. But when I tried to run it on the GPU, it failed due to the GPU’s limitations so we had to modify the model to accommodate them.
Currently, just a limited subset of operations is available on GPU. When an operation is not supported on the GPU, TensorFlow will fall back to using CPU to execute the remaining operations. You will see warnings in debug output when that happens. Performance drops dramatically in this case due to the memory copying — in fact, it’s even slower than running the whole model on the CPU.
The whole model relied heavily on tf.split() operation so we had to either remove or replace it. And our R&D guys nailed it and the model was then able to run on GPU but that wasn’t the end of the story. The output of the model was different when we ran inference on the same data on CPU and GPU. We literally received garbage after processing data on the GPU. It took us a while to figure out what had gone wrong. It appeared to be a bug in Tensorflow. The original model was taking 1-channel grayscale images as an input. As a workaround, we had to switch to using a 3-channel input and train a new model. This approach had a few drawbacks (bigger size of the model and extra conversions when preparing input data) but at least it now works on GPU with impressively high speed. The average time for processing one frame was 30–60ms on GPU vs 200–300ms on CPU depending on the device.
As an Android developer, you most probably don’t need to convert the model yourself. Better ask your R&D department or data scientists. If you just want to try working with neural networks there are plenty of demos available with ready-to-use models.
If you struggle with detecting inputs and outputs of some model, here’s a great online visualizer. It shows you the structure of the whole neural network with all operations, as well as all inputs and outputs.
A little bit about Neural networks and how they work
NN model is just a set of transformations (mathematical operations) that should be applied to the input data. It has inputs and outputs. In our case, there was only one input but multiple outputs. This is the typical process:
- We give input data to the framework
- Framework applies transformations from NN on it
- Framework returns us outputs
There were many parts in the integration process such as model prefetching, working with the camera and working with TensorFlow. I’ll go straight to the last part as it’s the most interesting piece.
I created a reusable component which contains a recognizer and a view. The Recognizer receives a stream of images from the camera, assesses the probabilities of different parameters and after that changes the icons in the view.
In the screenshot below you can see the visual representation of the component (that has 3 checkmarks with subtitles).
As you might already know, we use our RIBs framework for creating reusable components. So, I implemented the Gesture Recognition Component as a RIB.
It’s not essential to use RIBs for this task, you can easily use TensorFlow Lite without it and implement integration with it as you wish.
The interface of the RIB looked like this:
To build the component we need to provide its dependencies (config, inputs, and outputs). As an input, we can receive either a bitmap or a flag signalling to stop processing. As an output, we report if the component is ready to process the next frame.
As soon as we receive a new
Bitmap, we need to convert it to the format supported by the model. I created the following interface for it.
You might notice that this converter returns a
ByteBuffer. There’s a reason for this. In my case, the model was accepting a 3-channel image 224x224 with 1 float per component. In the common case the input format might be different. There might be a different number of channels, a different size and the pixels might be encoded as integer numbers. Always check the input format of your model before using it, this is very important.
Here’s what’s happening in the current converter:
- Scaling and rotating the bitmap. We apply rotation because camera preview callback may return an image in different orientations. Meanwhile, the model was trained on photos in portrait orientation.
- Drawing it on the grey background. The grey (hex code 0x797979) background is the best choice when working with neural networks because it has minimal impact on the end result. The value is from the middle of the grey range and after normalisation, all its RGB components will be equal to 0.
- Converting pixels to floats
Recognizer functionality was embedded in the MVICore feature. The feature accepts wishes with data and stores outputs in the state.
This is not really essential. You can use it or not depending on your needs. In my case, it was the best option as it can be easily integrated into the RIB.
Now, let’s take a closer look at the recognizer, the heart of the component.
It accepts several parameters — the source of the model, recognizer options (such as using
GpuDelegate) and outputs. The Interpreter class provides the core inference functionality, whereas
GpuDelegate allows running inference on the GPU.
I implemented 2 different model sources — file-bases and assets-based. The second one is used for testing.
As long as we use memory-mapping, the file containing the model should not get compressed. If you decide to store the model as an asset, don’t forget to add the corresponding flag to the android section in your
There’s one important thing to remember — method
runInference should always be called on the same thread when using
The interpreter is initialised lazily and its outputs are generated during initialisations based on outputs description (
There are 2 methods of processing data available in the Interpreter class from the TFLite API:
void run(Object input, Object output)void runForMultipleInputsOutputs(
@NonNull Object inputs,
@NonNull Map<Integer,Object> outputs
It’s not very clear how to use it, is it?
The first method can be used when you have just one input and one output. It’s not our case, as we had something around 20 outputs. We had so many outputs because our model was assessing different light conditions, facial parameters and the probability of each possible gesture. So, we used the second method (see the
TFLiteRecognizer class above).
The input will be a
ByteBuffer returned from the
InputDataConverter, but as long as the first parameter is an array of inputs, we should pass a single-element array with this buffer.
The outputs were generated during the initialisation stage. To simplify working with outputs I created a description for each of them (output name, size, and type). Here’s how it looked in the code:
The number of outputs was reduced for illustration purposes
At this point, we have descriptions of all outputs. Now we can implement the
OutputsData interface and decide which outputs we need.
Using this approach, we can easily generate all objects for all outputs and pass them to the interpreter. Later on, the function
generateTFLiteOutputs will be used to match generated data structures with actual output indices.
Recently, a new experimental feature was added to TensorFlow Lite. You can now add metadata to your model and then generate a wrapper for inputs and outputs. This is going to make working with models much simpler.
Once the interpreter finishes its execution we can move to process the outputs. For this purpose, there’s a postprocessor which reduces all the information extracted by the model to the state of 3 lights on the screen. These lights correspond to good light conditions, clearly visible face and proper gesture. To calculate the final result we take probabilities from the outputs (
OutputsData) and compare them to the predefined thresholds. Here’s the simplified code:
As a result of our work, users now get useful hints during the photo verification process, the percentage of successful verification attempts is increasing and moderators have less manual work to perform. The feature is currently rolled out on the limited set of high-end devices and we’re planning to roll it out to the rest of the devices soon.
I hope the article was useful to you. If you want a deeper understanding of the subject, check out articles by our Lead Data Scientist, Laura Mitchell. Neural networks can suit many other applications on the client-side and there are production-ready tools for it. So, if you’re still not using neural networks, this is a good time to start!