The Rise of the Model Servers

New Tools for Deploying Machine Learning Models to Production

One of the exciting developments in machine learning recently is the rapid emergence of a new class of model servers. Model servers simplify the task of deploying machine learning at scale, the same way app servers simplify the task of delivering a web app or API to end users. The rise of model servers, coupled with increasingly interoperable models, will likely accelerate the adoption of user-facing machine learning in the wild.

Although there has been an abundance of open source machine learning software, much of the ecosystem has been focused on model-building. The large Internet companies have built their own model serving infrastructure (such as FBLearner Predictor and Michelangelo), but there have been few easy options for the rest of us. When Zendesk deployed their TensorFlow models to production earlier this year, their team had to invest time and effort to figure out the best way to integrate model serving into their pipeline.

When it comes to deploying to production, a common punchline amongst machine learning scientists is that they should just put the DNN models that they have worked long and hard on “behind a Flask app.” While that works to a certain degree, there are now numerous alternatives that can scale and be managed much better. This class of software is so emergent that neither Stackshare nor Siftery lists it in their otherwise complete stack directories.

At the 2017 NIPS, the developers of many of these model servers provided updates on their increasingly mature projects. In general, the model server use case is straightforward. By pointing the model server at one or more trained model files, the model server can now serve inference queries at scale. The server handles scaling, performance, and some model lifecycle management.

The increasingly mature model servers available for use in deployment include

  • TensorFlow Serving
  • Clipper
  • Model Server for Apache MXNet
  • DeepDetect
  • TensorRT

TensorFlow Serving


While sharing the same moniker as the world’s most popular machine learning framework, TensorFlow Serving is a separate system that can serve both TensorFlow and theoretically models from other frameworks. It is likely the world’s busiest model server, since it has been in production at Google as part of Cloud Machine Learning Engine since fall 2016 and as part of its overall internal machine learning infrastructure called TFS² since winter 2017.

Unlike other model servers, TensorFlow Serving uses the grpc protocol, making it more performant but fussier to integrate compared to other model servers. It has a robust version manager for loading and rolling back multiple versions of the same model. Google claims TensorFlow Serving can handle “100,000 requests per second per core,” making it more than adequate for handling the load of nearly all applications.



Clipper is a model server project from Berkeley’s Rise Lab, sharing the same ancestry as Spark, which is widely used in machine learning. Clipper’s stated intent is to be model-agnostic, and currently claims to support serving Caffe, TensorFlow, and Scikit-learn models. It is currently gearing towards a 0.3 release and is working with early users, so it is less hardened than TensorFlow Serving.

Unlike TensorFlow Serving, Clipper includes a standard REST interface, making integration with other parts of an existing production stack straightforward. The TensorFlow Serving team actively credits and cites Clipper, so the influence of Clipper and its predecessor Velox in model serving is clear.

Model Server for Apache MXNet (MMS)

Press release:

Model Server for Apache MXNet (MMS) is Amazon’s entry in the model serving space. At ReInvent in November, Amazon doubled down on MXNet by introducing Gluon, a simpler API for deep learning that runs on top of MXNet (and Microsoft’s CNTK). MMS is nicely packaged with Docker images and automated setups, unlike TensorFlow Serving’s more complex install and build process.

A strong feature of MMS is the ability to package custom processing code in its model archive. This means that feature engineering can be transparently run in the model server. MMS includes an automatic nginx-based HTTP endpoint out of the box, so inference can be run easily by existing apps. MMS also provides real-time metrics for monitoring utilization, latencies, and errors of the endpoint and the inference service. Amazon did not include benchmarks with its initial announcement, but a custom version of MMS is under the hood of Amazon’s SageMaker service, the well-received end-to-end machine learning service that also launched at ReInvent.



DeepDetect is a machine learning API/server that includes a production model serving component for TensorFlow, XGBoost, and Caffe models. Written and maintained by the team at Jolibrain, DeepDetect is currently deployed at multiple European enterprise customers.

DeepDetect includes a clear prediction REST API that runs against its performant C++ backend. Unlike projects from the larger companies, users of DeepDetect can get professional support from a small but talented team of machine learning veterans.

Tensor RT

Product page:

TensorRT is NVIDIA’s highly-optimized model runtime for TensorFlow, Caffe, and other model formats. Coming from NVIDIA, it is designed to be extremely performant when used in conjunction with NVIDIA GPUs. On certain benchmarks, it claims to be up to 40-x faster than stock TensorFlow software on the same hardware.

TensorRT is not exactly a model server as it focuses on optimization and performance rather than on model serving management. Thus, it can be more similarly compared to an optimizer like TensorFlow’s own XLA. However, NVIDIA provides code ( and a tutorial ( for using TensorRT in a REST, model server-like context.

Other Options

Beyond the model servers listed here, there are other existing servers (such as Apache’s PredictionIO, Redis Labs’s Redis-ML, and Skymind’s Java-based model server, a key part of their end-to-end suite) and yet-unannounced projects. Also, fully hosted model serving services (such as Google’s Cloud Machine Learning Engine, backed by TensorFlow Serving) may be a more attractive alternative for some users than having to deploy a model server.


As machine learning is used by more and more organizations to serve end users, the ability to easily scale out inference is becoming increasingly significant. The rise of this new class of model servers is an exciting development that will accelerate these deployments. As these servers are deployed, we should expect adjacent add-ons (caching, monitoring, load-balancing, testing, security, etc) to enter the market as they did for web- and app-serving.

Thanks to Christopher Olston, Dan Crankshaw, Emmanuel Benazera, and Hagay Lupesko for reviewing for accuracy.