Compiling TensorFlow for the Browser

Another Approach for Running Machine Learning in JavaScript

tl;dr This post summarizes and discusses part of my undergraduate senior thesis (a research project that took place between September 2016 and March 2017): compiling Google’s TensorFlow library into JavaScript.
Demos at: https://tensorflowjs.github.io/

In case you missed it, Google just announced TensorFlow.js: a largely hand-ported version of TensorFlow to JavaScript. This is super exciting for me for two reasons:

  1. Generally speaking: I think it is a super interesting, and difficult, problem, so it’s really cool to see a big company tackling it and releasing their work.
  2. Closer to home for me: almost exactly a year ago (March 31st, 2017), I handed in my undergraduate senior thesis: Machine Learning in the Browser. It was an overview of what was possible, theoretical blockers, and a largely focused on my efforts compiling TensorFlow into JavaScript (which I, confusingly, also called TensorFlow.js). Needless to say, this is a space I care for and have thought about a lot.
Running Inception in the Browser. https://tensorflowjs.github.io/

I’ll start off by saying my work is not production-ready and is still very much a research project; however, given Google’s announcement, I wanted to pause and share what I had found, in hopes that it might motivate or inform someone else’s work. I was hoping I would be able to share all of this after I had had better luck compiling XLA output to JS (more on that later); however, some combination of it being a hard problem and life happening (graduating / moving to a new city / starting a new job), prevented me from making much progress, so there is still much work to be done.

Note: Almost all of this work is a year old, meaning the TensorFlow I compiled is pinned at version 1.0.0 (which was the most recent version when I submitted my thesis). If there is enough interest, I can look into bumping the version; when I was actively working on this, I bumped from 0.10 to 0.11, 0.11 to 0.12, and 0.12 to 1.0 all without any major issues.

Why would anyone want this?

When I was doing my thesis there were two questions I would commonly get: how is this research, and not just an engineering problem? (To which I still don’t have a good answer.) And why would anyone want to do machine learning (and specifically inference) in the browser? To me, it seemed like the obvious fusion of two trends: machine learning becoming ubiquitous and web applications becoming more powerful; concretely:

(Reason #1) Privacy Guarantees: This is fairly straight-forward, people want to be able to have the advantages of machine learning such as image classification, hot-word activation for speech recognition (e.g. “Ok Google…” or “Alexa…”), and smart reply to messages without needing to give up their raw data to companies. On the flip side, companies don’t always want user data, in a lot of cases it can pose a liability for them to be storing this data on their servers.

(Reason #2) Offline Mode: A lot of web apps are not supporting some limited, offline version of their application. It would be nice if these offline modes weren’t totally unable to run any ML at all.

(Reason #3) Free / Local Computation: Imagine a world where every ML model open-sourced on github came with a web demo where you could input an image to classify, or upload your own test set to see how it performs. Right now this doesn’t happen because no one wants to pay for the compute. With the capacity to run ML in the browser, no one would have to. More generally, this would enable a class of unhosted web applications to use machine learning.

Note: It is important to call out that: (1) this is not about running TensorFlow in JavaScript for some node-like environment. TensorFlow is a C++ core with bindings to many languages (Python being the most popular), so for that we could simply create a JS binding to the core; (2) I think inference is a much more valuable use case than training (with the exception of transfer learning to fine-tune models, I can’t imagine many usecases for full-scale ML model training in the browser). This is about being able to deliver and run machine learning models to people’s browsers.

Approaches

As far as I’m concerned, there are two real ways to get TensorFlow running in the browser: rewriting it (which would take a massive engineering force, something that large companies have at their disposal, but individual developers / students doing an undergraduate thesis do not) or compiling it. MXNet.js is an example of a major machine learning library being compiled into JS; Google’s TensorFlow.js / Keras.js / Tensor Fire are examples of handwritten implementations.

Lack of Google-scale engineering resources aside — and even with a working handwritten port of TensorFlow in JS — compiling TensorFlow to JS seems like an approach worth exploring for 3 major reasons:

(1) TensorFlow is rapidly changing, and having to constantly port it by hand seems like it will take a massive, sustained engineering effort.

(2) The Web is rapidly changing. Since the submission of my thesis and writing this blog post (almost exactly a year apart) the standards for the web are already different. APIs I was excited for have been deprecated, and new technologies have been proposed. Compiling it would abstract away these changes.

(3) Compiled JavaScript has some speed benefits over hand-written JavaScript (which is critical for performance-sensitive things, like machine learning), but more on this later.

Compiling C++ into JavaScript isn’t actually as crazy as it seems. Emscripten is an open-source project developed to compile JavaScript into ASM.js. ASM.js is a subset of javascript meant to be used as a compile target. Some browsers even fast-track the execution of asm code (by compiling it into machine instructions), allowing it to run at only 2x native speed, even for large benchmarks.

A primer on TensorFlow

Note: This writing is largely distilled from my research — which is almost a year old — so the following may be a little outdated.

TensorFlow — along with almost every machine learning framework — faces a tradeoff between needing a high-level language and needing speed. Without a high-level language, it is difficult to architect complex models and express abstract concepts in a readable way. Without speed, it is difficult to run the actual neural nets. TensorFlow — along with many other libraries — reconciles these opposing requirements by allowing developers to construct a computation graph in Python, and sending the graph along with any inputs to a C++ core to evaluate. (While Python is the most common front-end to TensorFlow’s core, the SWIG bindings allow for TensorFlow bindings in many languages that all interface back to the same common core.)

A TensorFlow graph consists of Ops (nodes) and Tensors (edges), imagine the following TensorFlow program:

import tensorflow as tf
# (Part 1) create graph
y = tf.add(
tf.multiply(
tf.Variable(2, name="m"), # initial value of 2
tf.placeholder(tf.int32, shape=(1,), name="x")
),
tf.Variable(1, name="b") # initial value of 1
)
# (Part 2) create and initialize session 
sess = tf.Session()
sess.run(tf.global_variables_initializer()) # sets m and b
# (Part 3) run inputs through the graph
print sess.run("Add:0", feed_dict={"x:0": (5,)}) # returns 11

Part 1 in the code creates a TensorFlow graph with the following shape:

As you can see, it has both ops (the circles) and tensors (the edges). Tensors are simply multi-dimensional arrays (a rank-0 tensor is a scalar, a rank-1 tensor is a vector, a rank-2 tensor is a matrix, etc). Ops take in 0 or more tensors and produce 0 or more tensors. For example, the add op takes in 2 tensors, and produces 1 tensor. The names for tensors are op_that_produced_them:index. For example, the result of the add operator is denoted as “add:0”.

Part 2 (in the code) loads this graph in the C++ core, and part 3 says fetch the value for add:0, initializing (or “feeding”) the tensor x:0 with a value of 5. The first op that can be evaluated in the graph is “multiply”, as all of its inputs are defined (m:0 and x:0). After evaluating multiply, multiply:0 is set to the result of the op. With multiply:0 set, all of add’s inputs are defined so it is evaluated next, which sets add:0 and returns it to the user.

TensorFlow compute graphs can be serialized and distributed as Protobufs (a serialization format developed and used extensively by Google). A common approach for deep learning in TensorFlow is to architect a graph in python, train it (i.e. set the variables in the graph), export that graph along with values for the variables to protobufs, and distribute those protobufs to servers or mobile devices.

A primer on Emscripten

TensorFlow is capable of running on many devices architectures (i.e. x86 on your laptop and ARM on your smartphone) through the magic of compilers. TensorFlow’s source code (and, specifically, the C++ core) can be compiled into machine-specific instructions for a variety of compile targets.

In broad strokes, compilers work by taking source code, “lowering” it to an intermediate representation (that is somewhere between source code and machine code), optimizing the intermediate representation, and then lowering that into machine code. For example, the popular compiler Clang compiles C++, by lowering it into LLVM-IR (the intermediate representation), optimizing that, and then lowering it into machine code. The first part of this process (source code to LLVM-IR) is referred to as the front end of the compiler, and the latter (LLVM-IR to machine code) is the backend of the compiler. Emscriten works by providing an alternative backend to Clang, and compiling the LLVM-IR to ASM.js (a subset of JS intended as a compile target).

Recently, web standards have even gone further in defining a good compile target. WebAssembly (WASM) is a binary protocol that is supported by every major browser (albeit behind a flag in some of them) that’s intended to be the compile target for the web. Because WASM is a binary protocol defined from scratch, it is smaller and can run faster than ASM.js. Emscripten can optionally generate WASM from C++ targets, simply bypassing a flag to the compiler.

Coercing TensorFlow to Compile

Even using Emscripten, there are still challenges to compiling TensorFlow. The first challenge I encountered was literally using Emscripten. Emscripten is invoked as a drop-in replacement for gcc, emcc. It is also distributed with two tools, emmake and emconfigure, to make it easier to integrate with pre-existing makefile build systems. TensorFlow doesn’t use a makefile. Rather, it uses Google’s open-sourced build tool, Bazel. I explored updating the bazel files to use Emscripten, but soon realized that (1) the bazel crosstool functions are largely undocumented (or poorly documented); (2) the requisite files would effectively require maintaining a separate build system that would require significant upkeep.

Fortunately, TensorFlow maintains a makefile for use with their mobile framework. One of the themes of this project is that in many ways compiling for JavaScript is similar to compiling for mobile (as resources and computational primitives are limited). By modifying the mobile makefile, we we embarked on a journey of debugging C++ compiled into a JS binary. That means we had few (read no) debugging tools, and limited insight into what is happening in the binary.

The first hurdle was that TensorFlow’s dependencies must be compiled into JavaScript. At the time, there are three critical dependencies for TensorFlow’s core: libmath, zlip, and libprotobuf. Because libmath is ubiquitous, it ships with Emscripten. ZLib has already been ported to Emscripten (although we discovered this after we already ported it). LibProtobuf had prior work in compiling it to JavaScript, and so all we had to do was fork the repository and bump the version (dealing with minor bugs). If you want to use the image ops (e.g. DecodeJPEG) you also need to compile JPEG, PNG, and GIF libraries. I have done this, but exclude it from here for the sake of brevity (feel free to contact me for more details if this is interesting).

The next issue was a type error, and to resolve it we would have to make TensorFlow work on 32-bit systems. Because all JavaScript numbers are implemented as IEEE754 double precision floating points (Section 6.1.6), they only have 53 bits of integer precision. This means that Emscripten can only reliably emulate 32-bit systems (as it could not address into a full 64-bit address-space). This is problematic because TensorFlow only supports 32-bit systems. Fortunately, I only had to make minor modifications to correct this.

TensorFlow uses Eigen as it’s linear algebra library — As an aside, this was incredibly lucky for me because many popular linear algebra libraries are written in Fortran (Blas, Lapack, Atlas) and could not be easily compiled with Emscripten — Eigen defines scalars to be of type std::ptrdiff_t. The problem is that it is common in the TensorFlow code base to do something akin to the following:

int64 a = 5;
// ...other stuff happens...
Eigen::Index b = a;

This will cause the compiler to (rightly) complain about type narrowing, because in JS Eigen::Index is only an int32. In order to fix this, we simply need to replace instances of int64 with the more correct “Eigen::Index”. This was a surprisingly small amount of changes.

At this point, TensorFlow will compile. It will just hang when it is run. This is because JavaScript currently has no way to create POSIX-style threads (although there are experiments to create posix-style threads using webworkers with a shared memory array in FireFox nightly). Vastly oversimplifying some details, the way in which TensorFlow processes graphs is by creating a threadpool and dispatching op kernels (the implementation of the ops) to other threads. On Emscripten, no threadpool is created, and the main thread is left waiting for a task that will never complete. To correct this, I had to dive into the concurrency model. TensorFlow’s scheduler effectively takes a function closure and dispatches it to a thread, so to correct this, simply replace the scheduler with a function that executes the closure immediately and on the current thread. This ensures that by the time any thread barrier is invoked all prerequisite tasks have already been completed.

TensorFlow will now compile and run in JavaScript.

DevX

This creates a JavaScript library that we can link other Emscripten-compiled C/C++ programs against. The problem is that I don’t believe developers want to write their machine learning for the web in C/C++, they tend to avoid it even for native programs. Taking some inspiration from the python frontend, my goal was to create a library that could take in a graph file, and return a class that the developer could pass inputs into and get outputs from.

This library would require a C++ component to actually use the compiled TensorFlow library, and a JavaScript component to wrap the C++ component and pass in inputs. Emscripten allows one to expose and call functions in the C++ using embind. Embind is incredibly powerful, but only really supports simple argument types. In order to pass graphs and input tensors to the C++ library, I needed to someway to serialize them into basic types. Fortunately, TensorFlow defines a Protobuf to serialize tensors! This meant that I could serialize the tensor in JavaScript, convert it to a string, pass it into the C++ layer (through embind), and deserialize the Tensor. I wrote a library to create tensors from multidimensional javascript arrays.

I then wrote the wrapper that loaded in the compiled C++ core and provided convenience methods to invoke it. For example, to run the simple y=mx+b graph from above:

const TFJS = require('tfjs');
const tensorjs = require('tensorjs');
// load in the graph protobuf, maybe through a string
// const example_graph_pb = ...
TFJS.for_browser('/tensorflowjs/').then(lib => {
const sess = new lib.Session(example_graph_pb);
const results = sess.run(
{
"x:0": tensorjs.intTensor([5]),
},
["add:0"]
);
console.log(results[0]); // prints 11
});

On top of that, I wrote a quick library to make encoding images as tensors easier:

const TFJS = require(‘tfjs’);
const tensorjs = require(‘tensorjs’);
// const mnist_graph_pb = ...
TFJS.for_browser(‘/tensorflowjs/’)
.then(lib => {
const sess = new lib.Session(mnist_graph_pb);
// pull out handwriting image from a canvas
const canvas = document.getElementById(‘handwriting’);
const context = canvas.getContext(‘2d’);
const img_data = context.getImageData(0, 0, canvas.width, canvas.height);
const img_array = lib.image_ops.get_array(img_data, true, 0, 255);
const results = sess.run(
{
"Reshape:0": tensorjs.floatTensor(img_array),
"dropout:0": tensorjs.floatTensor(1.0)
},
["prediction_onehot:0"]
);
});

Benchmarks

Average runtimes in seconds over 100 trials running on a Macbook Pro 2015 with an 2.9GHz i5 processor. The number in parenthesis represents the ratio when compared to singlethreaded C.

Given my naive approach, I did not expect to be fast; however, our JS execution times were within an order of magnitude of singlethreaded, CPU-only TensorFlow. For running full TensorFlow graphs in the web browser, this seems quite fast, and it backs up Emscripten’s claim that we might be able to get native code running in the web browser at near native speeds (something that always make me think on The Birth and Death of JavaScript). I expect these metrics will continue to improve as WASM becomes more powerful.

While speed was good, size was not. The library was 30MiB, with another 4.7MiB needed to initialize Emscripten. Compressed they were better (4.2MiB library, and 408KiB to initialize), but still large for web libraries. We can improve this by selectively including op kernels in the compiled code (op kernels are the code to necessary to evaluate nodes in the computation graphs); for context, at the time, TF defined 216 unique ops, and Inception v3 only took 11 to run. However, even if we were to improve the size of the runtime, model graphs can still be 100s of MiB (which would be unreasonable to ship to a web browser); while we can make that better with tools like quantization, we still have a way to go before we can use that in production.

Takeaways

Firstly, one of the advantages of waiting almost a year to publish this is that I have the advantage of hindsight. It’s funny reading the discussion section of my thesis to see a lot of predictions I made either come true or fall flat on their face. It does seem to be the case that people want machine learning libraries in the browser, something that was non-obvious when I was doing this. It also seems to be the case that the web is only continuing to get faster (WASM has continued support, and we have working groups for better use of the GPUs and other forms of hardware acceleration on the web).

Secondly, I’m starting to think that for the time being, compiling an entire machine learning runtime to the web is unfeasible. It is likely the case that we will have breakthroughs that enable this in the future, but for now we should probably direct our efforts elsewhere.

However, I don’t think this means that we should discount looking further into compilation as an approach; if anything, it shows that it’s possible. One of the experiments I’ve been most interested to try would be to take Google’s XLA (a compiler from TensorFlow graph to bytecode) and make it compile the graph to JS. This is possible because XLA uses LLVM-IR as an intermediate step, and, as you may remember, Emscripten compiles LLVM-IR to JS. Like XLA programs in general, these compiled-js programs might be faster and lighter-weight than the full runtime.

If you’re interested in this work, and want to talk about it or pick up where I left off, feel free to reach out! I would love to chat.

Resources

Website with Working Examples: https://tensorflowjs.github.io/

Repositories:

P.s. If there is anyone from Google reading this, congrats on the project launch!! It’s really cool! Also, if you would like the tensorflowjs github organization name —I would be happy to transfer ownership; you seem to have a better use for it than I do now :)