Snips Open Sources Tract
A Rust Neural Network library for the Edge
While TensorFlow and, to a lesser extent, PyTorch dominate the ecosystem of neural network training solutions, the landscape for inference engines on tiny devices, such as mobile and IoT, is still pretty much open. TensorFlow teams are pushing TensorFlow Lite, and Microsoft ONNX runtime seems like the perfect complement to PyTorch. Hardware vendors are pushing their own solutions, be it Android NN API, ARM NN SDK or Apple BNNS.
These libraries are not really interoperable, and have limited scopes: most of them, TensorFlow Lite included, only support a small subset of TensorFlow’s operator set. This means there is no clear go-to solution for an engineering team when their machine learning colleagues hand them a neural network to run on-device. On the other hand, machine learning teams will often have to pick a network architecture to satisfy the available runtime of a given target. Mimicking the interoperability efforts done by the software industry over the past decades, we think it is time to aim for a “train once, infer anywhere” approach.
This ambition led us, nearly two years ago, to develop our own solution for embedded inference. It has been used in production for over a year as part of our Wake word engine, and it enables our machine learning team to freely explore new families of networks with the confidence that we would be able to bring these networks to production.
Even in today’s context, reverting to an off-the-shelf inference library would mean sacrificing either modelling creativity, performance, or portability. We believe the embedded neural network inference landscape still needs to progress and converge, which is why we decided to contribute to the collective effort by open sourcing our solution. In this post, we describe the library, its performance, and provide high-level motivation for the implementation choices we did to optimize an application field that is close to our hearts: voice processing.
One of the major hurdles we encounter at Snips is cross compiling our solution for all of our targets. This includes small single-board computers designed for hobbyists like the Raspberry Pi Zero, 2 and 3, as well as industrial ones like NXP’s i.MX7 and i.MX8, Samsung’s Artik platforms, NVIDIA’s Jetson TX2, etc. The two main mobile platforms, iOS and Android, are also natural targets. This means we support ARMv6, ARMv7 and ARMv8 hardware platforms, on three major operating systems: GNU/Linux, iOS, and Android. We also support the regular non-IoT Intel systems, so that developers can work comfortably.
Snips Flow, our voice platform, includes a Wake word engine. This engine listens continuously, and triggers the speech recognition component when it hears the user say the Wake word (“Hey Snips”, “Alexa”, etc). The Wake word detector relies on a neural network that our machine learning team trains using TensorFlow. Two years ago, before TensorFlow Lite came to existence, the natural approach we tried was to embed TensorFlow as a library to execute our models on device.
However, TensorFlow is a big, complex framework, and, nearly two years ago, cross-compiling it as a library — to Android specifically — gave us a lot of pain. So much that we realized we needed a plan B.
At Snips, we chose Rust as the main language to develop the Snips Flow platform. Our team got used to the comfort brought by working with a modern software environment where cross-compiling is not an obscure afterthought but a solid design principle. So we gave it a try. We used the protobuf library to parse the TensorFlow format, ndarray to operate on tensors, and implemented a handful of operators that were part of our first Wake word models. This was the birth of Tract.
While we finally managed to get TensorFlow working on all our target platforms, the sheer size of it motivated us to actually switch to shipping Tract instead of TensorFlow a few months later.
If the origin story can sound anecdotal, the reasons behind it are not. As of today, the continuous integration infrastructure of the Snips platform targets dozens of operating systems and hardware combinations, with more added every month. This leads to some strong guiding principles in the evolution of Tract:
- it is written In Rust
- it is trivial to cross compile
- it is free of external dependencies: instead of linking computing libraries, like BLAS, we integrated the small bits that we actually use (which in turn gave us a few opportunities for optimizing even more)
- at runtime, Tract detects the device it is running on in order to adapt and optimize its performance appropriately. This allow us to ship the same binary for ARMv6 and ARMv7 devices without sacrificing performance on the smarter chips.
Today, some of these principles are actually vindicated by TensorFlow Lite. The latter is much easier to cross-compile than TensorFlow, and also takes the form of a static library free of external dependencies.
Snips is focused on voice assistant technology. Neural networks are the natural solution for several tasks involved: Wake word detection, recognition of voice commands, speaker identification (finding out who is speaking) and acoustic modelling (translating sounds to phones or words, as part of a speech-to-text engine).
All of these tasks have to be performed in the context of interactive sessions with a human user: time is a central element of the problem description, and minimizing latency is key in making the interaction comfortable.
Wake word detection, for instance, has to happen “live”. Such a detector is always-on: there is no natural end to the input signal that it should wait for before processing the entire signal. In other words, the detection needs to happen in a streaming fashion, which means the engine needs to make a decision at every step in time based on the signal captured in the immediate past. This puts hard constraints on the neural network architecture, forbidding, for instance, reductions over the entire time axis.
This streaming constraint pushes the inference frameworks to their limits: most of the popular frameworks are designed with image classification in mind. They only work naturally over the entire signal, and become very awkward to work with when streaming is required, if they work at all. Kaldi — probably the most popular open-source speech-to-text framework — is a notable exception: its neural network inference engine is designed around streaming.
In TensorFlow, network graphs are “frozen” at train time, and contain a series of training-related idiosyncrasies. They must be optimized to run inference efficiently. Tract transforms networks after loading them, first by decluttering the network of those idiosyncrasies, then by adapting the network to the runtime environment. While TensorFlow Lite does this during a network translation stage, we chose to do it just before runtime, to be able to perform machine-dependent tweaks.
One of the critical transformations Tract performs is translating convolutional models to a streaming form. Indeed, our preferred architecture for Wake word detection and user identification consists of stacks of convolutional layers, in the same fashion as ResNet and WaveNet. While recurring networks are relatively natural to run in a streaming fashion, convolutional networks need a bit more work: a lot of operations can be skipped by implementing caching around each convolution operator.
Convolutional networks transformed naively to frame-by-frame streaming stateful networks would suffer from operation and data dispersion: a lot of flow control logic and data access work would have to happen for each incoming audio frame while performing a relatively modest number of useful computing operations. To run efficiently, it is necessary to re-introduce a grouping of frames, that we nicknamed pulses. By processing a reasonable number of frames together, in the order of 8 or 16, we amortize the flow control overhead. Additionally, this provides vectorization opportunities to optimize the execution even further.
Once networks run through these transformations, they can be fed in a streaming fashion and output decisions in real time. While we still consider it experimental, this “streaming and pulsing” transformation is applicable to all convolutional networks. We have successfully used it to run our WaveNet Wake word networks and ResNet speaker identification ones. Acoustic models will be next.
With the necessity of reusing efficiently cached data from previous iterations, our main use cases does not fit well in regular frameworks. But rather than solely working on voice-related applications, we felt like we could learn a lot from experimenting with more common use cases, in order to get a complimentary assessment of the performance of Tract itself. We chose to implement enough operators to make a few popular models run, in order to get a fair comparison of the intrinsic performance of Tract against TensorFlow Lite. Obviously, supporting these operators also widens the field for possible third-party use cases of the library. Today, for instance, Tract can run unmodified pretrained Inception v3, MobileNet, and the acoustic model from DeepSpeech.
We also implemented about 85% of ONNX, initially to take advantage of its extensive test-suite that helps us cover the internal operations it shares with TensorFlow. But now, as a “free” side effect, our machine learning teams can choose to work with TensorFlow or PyTorch on a per-project basis.
On top of ONNX and TensorFlow, we are also considering new importers: we are specifically interested in importing the Kaldi format, as our speech-to-text engine uses it. We are also paying close attention to the NNEF format: despite a large support from hardware and software vendors and a very elegant specification, it has received very little attention from the machine learning community.
How good is Tract ?
There are several answers to this question.
The first angle is “will Tract run this network”? Supporting TensorFlow’s immense operator set is not, and will probably never be a goal for Tract. On the TensorFlow front, we implement missing operators on a per-application basis. As of today, Inception v3, several ARM Keyword spotting networks, Mozilla’s DeepSpeech, MobileNet networks and others run. Adding operations is not difficult. Or at least, not much more so than implementing the computation itself.
On the ONNX front, things are a bit simpler: the operator set has a reasonable size, and comes with a test-suite. We cover about 85% of version 9 of the operator set. Recurring operators are coming soon, and the last remaining flow control features are also on the roadmap. 100% coverage of ONNX is an objective for Tract.
Beyond the perimeter of supported operators, comes the question of performance. Tract does not try to compete on the big-hardware side of things: Snips’ main interest lies in running neural networks on small devices. For instance, we have no support for GPU acceleration but we want to be as efficient as we can on ARM CPUs. We pay attention to devices ranging from ARMv5 or v6 with VFP (like a Raspberry Pi Zero) to the bigger ARMv8 with Extended SIMD instruction sets that can be found in most modern smart phones and may become the workhorse of the IoT industry. We also keep an eye on Intel’s CPUs as it is the native environment for most developers.
On the chart below, we compare Tract and TensorFlow Lite’s performance on two neural network architectures that are supported by both libraries. The first one is an early version of the Snips Wake word detector, which we’ll call v1 for the sake of clarity. We then switched to the very convolutional networks that we prefer today, but which aren’t supported by TensorFlow Lite. Snips Wake word v1 relies on a 1D CNN architecture. The second architecture we consider here is a 2D CNN network from the “Hello Edge” paper by ARM.
On these two examples, Tract outperforms TensorFlow Lite on all devices considered. There is a bigger bonus on the two before-NEON ARM devices: we have found that TensorFlow Lite, as many other players, is neglecting these older and smaller chip variants.
As a matter of fact, Tract also outperforms TensorFlow Lite on these pre-NEON targets for other challenging tasks like image classification. For example, Tract brings a 2.5x speedup on the Inception network on Raspberry Pi Zero. TensorFlow Lite’s performance on this network is harder to match on bigger devices.
In the figure below, we show the performance of Tract on all kinds of voice processing applications (user identification, Wake word, voice commands), including neural network architectures not supported by TensorFlow Lite. This figure shows that a lot can be done on the edge, on tiny devices, with CPU to spare. Gaining in efficiency is like a double-edged sword with only good sides: between the two iterations of our Wake word model, we took advantage of the extra efficiency gained in Tract to run a significantly bigger and more accurate neural network architecture.
Yet another library?
Today, there are comparatively more libraries available off-the-shelf than when we started working on Tract two years ago. Google, ARM, Apple and Microsoft are actively developing their own offerings. While there is overlap between them, each one has its own angle, or operator subset covered, and it does not look like any will be an obvious go-to solution anytime soon. Each one has a bias towards the type of applications and the network architectures each institution is primarily exploring.
Tract brings specific attention to voice processing applications like Wake word detection, voice commands, speaker identification, and soon acoustic modelling, on low- to mid-end CPUs. Anyone working in these areas or other streaming-based applications may be interested in looking at Tract.
As an Artificial Intelligence company, we made the choice to be in control of the software we are deploying in our users’ environment, be it a home, office or factory. Concretely, instead of developing and maintaining Tract, we could have hard-coded every new architecture we adopted, which would probably have refrained us from exploring new architectures. We could also have made massive contributions to TensorFlow Lite to extend its perimeter. For such a critical piece of the stack, our engineering team feels better owning the neural network engine, controlling how it is built, and keeping its performance and size in check.
In future posts, we’ll be diving deeper into some experiments and optimizations we explored with Tract, what worked, and what didn’t. Follow us to keep informed.
The point of Tract is not to directly challenge TensorFlow or PyTorch as a generic go-to solution, but we did learn a lot in the process. Moreover, Tract’s performance is significantly better than that of TensorFlow Lite for the networks we deploy in production. We now have an angle on the problem of running deep models on device, we have built a tremendous amount of experience, we have found a few issues and a few original solutions. We are also open sourcing Tract to support the global effort of moving inference and in particular voice processing on device. We are hoping that sharing code, experience and thoughts will contribute to the landscape converging to more compatibility, and more consistent choices. Only then will we be able to “train once, run anywhere”.
Lastly, we are now making use of this experience to prepare for the next big wave that is coming to the IoT landscape: Neural Processing Units. Semiconductor manufacturers are now bringing dedicated chips that are specialized to run deep learning operations. These chips will literally unleash the potential of what computations can be run on the edge. They will significantly contribute to edge computing becoming even more of a reality. They will naturally bring their share of new questions on what paradigms to follow, what operators and architectures to prioritize, how to serve efficiently all application fields, from vision to voice, etc. This is only the beginning.
Interested in learning more?
If you liked this article and want to support Snips, please share it!
Follow us on Twitter: @Snips