Alex Moiseenko
Published in
5 min readDec 2, 2020


You may say: Man, you’re mad, mobile devices are not good for training. And you’re almost right. There is no sense to train the network from scratch. However, personalizing neural networks for a certain customer on a certain device is a great thing. This is what Apple does when you set up your new iPhone. iOS asks you to say “Hey, Siri” a few times to update a neural network and recognize your voice better. I have the same task. I have a word detector project where I need to personalize a neural network for every customer the same way as Apple does.

Before I start talking about libs for training, I would like to tell a little bit about neural network architecture. In previous article about GRU inference, I told that I have net: Conv1d -> BN -> Relu -> GRU -> BN -> GRU ->TD Dense.

I found out that it fails on long words. Even with the binary dataset, it fails. When I switched to LSTM, it performed much better with binary dataset but still not perfect since I marked only the end of the phrase. NN doesn’t understand where the beginning of the phrase is and it could recognize only the last half of the command. When I played with data, I achieved better results but still, sometimes it recognized only by the first part. It was problematic to achieve the result I wanted. I decided to move to Bidirectional LSTM or maybe GRU. BiLSTM should work better, but if BiGRU has the same performance, there is no sense to have one more gate in weights :)

Here is what we have for training on the mobile device at the moment.

Let’s start with iOS since it’s my favorite platform.

  1. From iOS 11. We have the Metal Performance Shaders training graph. It seems like a great choice but GRU doesn’t work in MPS. Since inference on CPU is much faster for recurrent nets and Metal uses GPU, I am not sure that it is better in training than CPU for recurrent nets. Also, it is iOS/macOS only.
  2. From iOS 14. Owwwhh. Apple did a great job for NN training in this update. They updated BNNS, appended MLCompute. They added a Metal Performance Shaders Graph as well. But all these libs work on Apple devices only and starting from iOS 14 which is not very good. Also, all of them ignore my favorite GRU.

Thus, Apple solutions were not suitable for my project at all. Let’s go to cross-platform open-source solutions.


I read articles about the successful building of this monster for mobile devices. And nothing strange, since TensorFlow has C++ API. But this framework is too huge for a small mobile device. Potentially, I could play with CMake or Bazel to reduce source code and use only the things I need, but I decided to go the other way. Also, from the previous article, I found out that TFLite is 5–6 times slower than my solution on Apple devices. Spoiler: I figured out why, keep reading.

Own solution

I decided to update my iOS NN lib for inference to cross-platform NN lib for training and inference. It was a long and sometimes annoying process. Let me tell you some interesting things I found.


For porting my lib on Android I needed to port source code with basic operations and FFT. All these things are handled by the Accelerate framework in Apple. I needed to find something similar open-sourced. And it looks like it’s much simpler to find free gold in the city center. Since I decided to support mobile devices, I decided to write basic ops using ARM NEON Intrinsics. I achieved almost the same performance as the Accelerate framework, except for Matrix multiplication operation. Then I try to find matmul C lib since my source code was written in C. I knew about Eigen, but it uses C++ and I wanted to use C only. But I did not found any normal C lib for matmul. Maybe I searched not so good, but then I looked at TF and Pytorch and found that both of them used Eigen under the hood. I stopped searching and just used Eigen. Now because of Eigen, my default_ops file has cc format and only around 10 lines of C++ code. Well, who cares if it works fine. But when I tested Eigen and compared performance with Accelerate matmul I found out that Eigen is 7–9 times slower than Accelerate in release mode. How mad Apple engineers are who optimized that so hard!! This is an answer to the question why the TFLite model was 5–6 times slower than my solution based on Accelerate.


To make sure that my gradient calculus is right I compare my results with tf 2.3.0 results. At the end of development, I assembled a Neural network with all layers I had and polished everything to be the same as in TensorFlow. Now it works fine.


Now, NNToolkitCore can train and inference on mobile devices. It supports following NN layers:

  • Conv1d;
  • BatchNorm;
  • RNN;
  • GRU;
  • LSTM;
  • Bidirectional;
  • Activation;
  • Dense;
  • TimeDistributedDense;


As I mentioned before, API of library is BNNS-like. Every layer has following functions:

  • <Layer>CreateForInference; <Layer>CreateForTraininig (it allocates data for cache and accepts data with batch size > 1),
  • <Layer>ApplyInference, <Layer>ApplyTrainingBatch;
  • <Layer>CreateGradient, <Layer>CalculateGradient;
  • destroy functions


  • Cross-platform. It works on any iOS and Android device;
  • Opensource. Feel free to contribute and make it better;
  • Tiny. It’s really small since it’s written in C;
  • ARM NEON acceleration. Maybe I will support AVX in the future or you can help me with this;
  • It has a great performance on Apple devices because of using Accelerate framework.


My library is suitable only for time-series data right now. I don’t have CV layers like conv2d, pooling, etc, but it’s better to run such layers on GPU. Also, it is not a problem to append them in the future. Feel free to contribute to my library and share it with friends. If you have any questions you can drop me email on github.