Adventures in Deploying a Deep Learning Model in the Browser

Kyle McIntyre
The Quiq Blog

--

Last year we embarked on a project to develop a predictive typing feature for conversational business messaging. This article doesn’t focus on the feature or model itself but rather our experiences deploying it in the browser. We discuss our reasons for attempting a browser-based deployment, selection of a browser runtime for deep learning models and some lessons learned about how much of a ML feature should live in the model vs. in the application.

Background

A predictive typing model (one that helps you complete your words and thoughts via tab-complete) is inherently high-scale. At Quiq, we often think of our scaling factors in rough terms like — does it scale in the number of logged in users, conversations or messages? In this case, the scaling factor is at the sub-message level. We call it typing scale. In addition to being very high scale the model must also exhibit low latency. If the latency is too high, we won’t be able to get suggestions in front of our users’ eyes before they proceed to complete their own words and thoughts.

Our web-based software is often used within contact centers, some of which are operated overseas. We’re excited to see where edge computing is headed especially as it relates to ML, but we take it as a given that some of our users will have significant network latency to our servers. This, coupled with concerns about our cloud computing bill led us to consider the approach of deploying the model in the browser.

We had plenty of concerns about pursuing a browser-based approach as well:

  • Will there be a myriad of browser compatibility issues/unsupported browsers?
  • Will we tank the performance of the app hosting the model?
  • Will the latency saved by avoiding the network be lost due to lower compute power on client machines?
  • How much will we have to weaken the model accuracy to make it viable client-side?
  • How much code will have to be ported to Javascript and dual-maintained?

We were confident, however, that a browser-based approach would be operationally cheaper. Moreover, we weren’t technologically prepared to host the model at scale server-side even if we were willing to pay the costs and assume the sunk network latency cost. So — with significant work in either direction — we decided to give the browser-based approach a shot.

Our starting point was a Python/TensorFlow 2 training pipeline that was successfully producing TF SavedModels that did a pretty good job at predicting text.

Finding a Deep Learning Browser Runtime

Our first attempt to serve the model in the browser centered around ONNX (Open Neural Network Exchange). We liked the idea of being able to deploy models trained in various deep learning frameworks onto ONNX as a common runtime based on an open standard. But we encountered a significant problem: At least as of late 2021, the ONNX browser runtime project seems incomplete and inactive. There are a bunch of critical ops that it doesn’t support. So while we were able to get the TF SavedModel → ONNX converter working just fine, the ONNX browser runtime couldn’t run it. ONNX seems like a great project, but the browser runtime simply isn’t viable.

Next we turned to what seemed like the only other real possibility: TensorFlowJS (TFJS). We were initially under the impression that TFJS was tightly coupled with TensorFlow but we have since learned that TFJS is a standalone deep learning framework. Even if you train in another framework like TensorFlow or PyTorch, you can generally convert your saved models to TFJS’s model format — similar to what we were hoping to achieve with ONNX. So far so good!

TFJS supports several ‘backends’ — runtimes capable of executing your TFJS model, some of which are supported in web browsers. Here are the runtimes that work in the browser: WebGL, WASM (SIMD, SIMD+threads) & CPU. WebGL is currently the recommended runtime and was the top performer in our model benchmarks throughout our development lifecycle. WASM was decent especially w/ SIMD & threading support & CPU was glacially slow as expected. WebGL also had sufficient browser support for our purposes.

We started with a basic model that was trained using the keras Layers API. We were able to convert the model to TFJS format and run it with minimal effort! This was a great start to our project. However, our full model took more effort to convert.

Being Compatible

TFJS doesn’t support everything that can easily wind up in your TF SavedModel. In particular, we realized that TFJS didn’t have support for any of the text stuff we were doing in our TF program. Here’s some stuff we had to reimplement outside the model:

  • Routines from tensorflow_text for normalization & tokenization
  • Preprocessing layers from keras.experimental.preprocessing for things like vocabulary encodings

The realization that we’d be handling all text preprocessing ourselves outside the framework made us more cautious about doing fancier text tokenization schemes since we’d likely have to port it to Javascript. Our mantra became ‘code it up in Python, and if it significantly helps prediction accuracy we’ll investigate porting it’.

It should be noted that TFJS supports custom Ops/Kernels that might facilitate keeping your preprocessing in the model. But we needed to do a port either way and decided to skip learning how to do custom ops. We later realized this was a stroke of luck because of op scheduling overhead, which we’ll discuss later in the post.

The Thrill of Victory, the Agony of Defeat

It was very exciting when we first saw our full model run in the browser. It WORKED! And the numbers were dead nuts accurate with the Python version. Unfortunately, it was running at the approximate speed of a snail. Things were taking between 10-20x longer than we had expected based on estimates from our simple Keras/layers model. Both were executing on the recommended WebGL backend for TFJS and had roughly similar network complexity.

At this point we learned a bit more about how the WebGL backend works: it actually translates your TFJS model operations into GLSL shader programs and executes them on your graphics device. This allows it to leverage any available hardware acceleration and is what enables it to perform heavy operations faster.

The translation of a TFJS model to various shader programs is fairly slow: it took between 1–2 seconds on our model/machines. Consequently TFJS maintains a cache of shader programs based on the ‘method signatures’ of your model & ops. The cause of our big performance problem turned out to be our use of inputs and outputs with dimensions of dynamic length. These were continually breaking through the method signature-based cache of shader programs.

We converted our dynamic-length inputs to a fixed length of sufficiently large size and then padded them to work around this problem. At this point our performance was roughly where we had predicted. However, the more we played with it in the app on various hardware, the more it seemed like it still wasn’t fast enough. We also realized we needed more control over the model as it executed.

Deploy a Network, Not a Program

At this point, all of the text preprocessing was done outside the model, but the model took care of everything else. What do we mean by everything else? We modeled the predictive text problem as a sequential word prediction task. The basic steps are as follows:

  1. Tokenize the text typed so far
  2. Run each token sequentially through an autoregressive model to update its hidden state
  3. Now you’re in the ‘present’ i.e. where the typing cursor is at
  4. Begin predicting tokens in the future — greedy & beam search variants

Steps 2 and 4 have loops and other control flow in them. Where should the loops live? TFJS supports control flow so we had a choice to put them in there, or outside the model in Javascript. Once again, we tried putting it all in the model to save porting the logic and hopefully run faster. This approach works but is unfortunately slow.

As hinted at in the last section, TFJS models are based on a graph representation that cannot only express deep learning networks but also general programs/execution. A graph is a graph of Ops — operations to execute. It is our understanding that most (if not all) of the TFJS backends, and especially WebGL, incur a latency cost (e.g. milliseconds) to schedule and execute an op, regardless of how heavy or lightweight the op is.

This makes a lot of sense in the context of the WebGL backend: it executes your models’ ops asynchronously as graphics shader programs on your graphics device. This requires OS-level scheduling and coordination. So even though WebGL can execute heavy math stuff pretty fast by leveraging any hardware acceleration you might have, it can be slow if your model is composed of lots of tiny ops. It’s for this reason that we took the loops and other control flow described in steps 2 & 4 out of the model and implemented them in Javascript. Those basic loops and conditions were really slowing us down! That’s one reason we think TFJS should be used to ‘deploy a network, not a program’.

Aside from performance, it’s also advantageous to move control flow outside of the model so that you have greater control over things at runtime. You may want to introspect on time remaining or application state so that you can potentially interrupt and return early. This is especially true in sequential problems. It’s possible to perform such introspection inside of your TFJS model if you pass enough data in, but it will likely be harder to implement and slower to execute.

By the time we released the feature, the only stuff left in the TFJS model was a single time-step/iteration of the purely numeric deep learning network. No text preprocessing or postprocessing. No search, no loops. The sequential nature of the problem was tackled in Javascript.

Although the ‘final’ version didn’t look much like we had originally envisioned, we were happy with the result. In addition to getting our performance to an acceptable level, invoking the model in a less monolithic way provided us with much greater control and ability to optimize. For example, we could now easily add caching and loop-breaking timeouts. By using asynchronous Javascript we were able to ‘throttle’ model execution more easily so as to not impact rendering in the main app. The single iteration of the numeric deep learning network was the building block we needed TFJS to tackle for us — everything else worked better outside TFJS.

You’re in the Browser — Now What?

You’ve got a compatible model that’s a network and not a full-fledged program. You’ve got the surrounding Javascript to make it all work. What’s next? Well there are other concerns associated with hosting the model in the browser that you need to tackle if you’re going to deliver a production-ready feature.

With any interactive client application it’s important to tune your code with an eye towards performance. After all, even the best predictive model isn’t going to be useful if running it ruins the user experience. One of the first things we knew we had to address was delivering the model assets to the browser promptly to make sure we could get the model running in time to be useful. We decided to try and cache the assets in the browser in order to save the network round-trip time as well as keep bandwidth free for the other network calls our application requires. Although the model assets are too large to save in browser local storage, TFJS has a pre-built adapter for the IndexedDB API. The larger storage limits on this API allows us to save and load the model artifacts on the client side. With a little extra work to store our additional client and vocabulary data alongside the model, we now only incur download costs a single time per model when the user first turns on the feature (assuming the browser doesn’t delete it to reclaim storage space). We also wanted to make sure our users wouldn’t load the hefty TFJS libraries if they didn’t have this feature enabled — a problem solved thanks to Webpack’s dynamic import code-splitting feature.

Once the model is loaded and running in the browser we want to give it enough system resources to deliver timely yet accurate predictions. However this is balanced by the need to prevent blocking the main javascript thread so that the UI can continue to update and respond to user interactions. We investigated running the model in a Web Worker to stay off the main thread completely, but so far it doesn’t seem that the WebGL backend is fully supported in Web Workers. Instead we opted to run on the main thread, using async programming to ensure our prediction loop ceded control back to the browser at regular intervals. This strategy provided us with a good balance of model performance and UI responsiveness, and really made it feel like a powerful, modern feature that would both be useful and feel good to use.

We also took care to prevent running duplicate instances of our models to keep memory usage in check. This was especially important as our application can be iFramed within other sites, potentially running several separate instances on a single page. We set up a message passing service using a Shared Worker to communicate across the iFrame boundaries and allow each instance of our app to access the feature while only one of them was in charge of the model. TFJS also requires some manual garbage collection, which we tied into the model lifecycle to prevent any leakage.

Results & Conclusion

We deployed the feature to a subset of our users last fall. So far the feature has executed robustly and we haven’t been plagued by browser compatibility issues. We instrumented the feature’s execution time and the observed times out in the wild are similar to those we saw during our late stage testing and are acceptable. We haven’t received any complaints or observed any metrics that would point to general application or machine lag, which we attribute to ceding control back to the browser loop regularly and often.

Perhaps if we were to do it all over again, we would just do the project in Javascript/TFJS from the beginning to reduce our concerns about duplicate code maintenance. However, to date we don’t have any experience using TFJS for model training or authoring non-trivial forward passes etc.

In conclusion, we were able to successfully deploy a non-trivial, sequential deep learning model for predictive text to the browser using TFJS. The final result didn’t look like what we expected, but we’re happy with it. TFJS is a great piece of open source that opens up a world of ML possibilities. We hope that our experience and conclusions are valid and help other folks make great use of it.

Acknowledgements

The following individuals contributed heavily to this project and blog post:

--

--

Kyle McIntyre
The Quiq Blog

Family man, software builder, data scientist, Montana kid, homesteader.