Deploying TensorFlow Models in Microservices

Kyle McIntyre
The Quiq Blog
Published in
6 min readDec 8, 2021

If you’re like us, you’ve got an application that’s built on a mature microservice architecture. Your microservices are built on docker images and primarily expose REST APIs. You’ve also got one or more TensorFlow models that you want to deploy as part of your application. You’re hoping that you can just wrap your model inside of a microservice. Why? Because your architecture has already tackled the following problems:

  • Zero downtime upgrades & version management
  • Redundancy, horizontal scalability, auto-scaling
  • Independent release cycles
  • Unified monitoring & logging
  • Multi-tenancy concerns
  • Compliance & regulatory concerns

You view the world as a collection of microservices and you don’t really want to add something special to your environment. Is it possible to wrap your TF model in a microservice, and is it wise?

Background

First off, some background on our architecture:

  • Pure HTTP/REST microservices built on docker images. We currently have 60 or so
  • HashiCorp Nomad for container orchestration
  • HashiCorp Consul for service discovery
  • Docker containers based on Alpine or Ubuntu Linux

It’s notable that we don’t use Kubernetes for container orchestration. If we did we might have tried using Kubeflow to deploy our models. However, we prefer Nomad for its more focused and lightweight operational nature.

When building a new microservice in our environment, there are some standard HTTP resources you need to implement:

  • A healthcheck
  • A resource for forcible shutdown
  • A resource for monitoring (e.g. prometheus metrics)
  • Resources related to change tracking
  • A Swagger page

In addition to the standard HTTP resources, there are implicit requirements for your service

  • Registering yourself with Consul on startup
  • Log in a common format
  • Negotiate secrets in a standard way
  • Various security measures such as checking JWTs and implementing an IP allow-list for all inbound requests

We have implementations of this core ‘contract’ in Scala, Go & Python. Then individual microservices extend those implementations with actual application logic. The Python skeleton is based on Flask + GUnicorn.

Given this landscape, it’s really tempting to try running our TensorFlow model inside of a Flask-based microservice because we’ve already blazed that trail.

A Naïve Approach

For our first attempt, we tried the most obvious approach: we created a microservice container with the following contents:

  • Our Flask microservice skeleton
  • Tensorflow & associated dependencies
  • The model, statically built into the image
  • A REST API to invoke the model and perform the necessary preprocessing and postprocessing steps

In this setup, the model was loaded statically during app initialization. The results of this approach were ….*drum roll*…. abysmal . The service would typically become unresponsive after a single invocation. What was the problem??

Unbeknownst to us, the Python tensorflow module isn’t set up very well for concurrency at the user level. Sure, it can parallelize computation with the best of them, but it’s designed for doing one task at a time. We don’t claim to be experts in this area of TensorFlow (or any of it, really) — but it’s related to the TF concept of a Session. Receiving multiple web requests and attempting to have them all run inferencing on a statically initialized model resulted in deterministic deadlock — even if they didn’t run concurrently. The architectural problem seemed to be that TensorFlow isn’t fork-safe, but we needed to use a forking web server.

We were able to work around the deadlock issue by initializing the model once per request (although it still seemed a bit shaky). But of course, this is totally unworkable from a scalability perspective. These are deep learning models after all! Hundreds of megabytes is totally common. We couldn’t afford the load latency or the crazy amounts of RAM it would require to service concurrent requests.

At this point we realized there must be a reason people went to the effort of building the separate TensorFlow Serving project 😂

TensorFlow Serving

There are a lot of articles about TensorFlow Serving, but too many of them just repeat the same blurb from its homepage. Here’s our take on the project:

  • It is the way to deploy TF SavedModel formats server-side. The main Python tensorflow project is unsuitable for this purpose.
  • The most important architectural aspect of TF serving is that it’s designed for request concurrency. A model only gets loaded once but can be leveraged multiple times, simultaneously
  • Almost as important: TF serving provides an abstraction around request batching, which is essential if you want to save on inferencing costs by leveraging a GPU/TPU
  • It’s kind of opinionated and it solves other problems you may not care about i.e. model versioning and upgrades

After doing some research and experimentation with TensorFlow Serving, we were convinced it was the path forward. But, it seemed heavy in the sense that it’s meant to run as a gRPC or HTTP server. We were hoping for something more akin to a library. But, short of dissecting the TensorFlow Serving build and/or trying to write python wrappers around the C++ API, we were stuck with running TF Serving as well, a server. We considered three ways of doing this:

  1. Running TensorFlow Serving as it was meant to be run(?) — a specialized, networked entity within our environment that would be explicitly managed, almost like a database. This would require significant Ops support
  2. Nomad has a concept of Jobs that encapsulate ‘something you need to do’ via container orchestration. Jobs can actually have multiple tasks, which can use different containers but kind of run next to each other. We considered running TF Serving in one task, and our microservice application in another.
  3. Embedding TF Serving within the same container as our microservice application and running it as a child process

Option #1 just didn’t feel right to us. Our microservice-tainted minds wanted to avoid a centralized database style approach & associated operational maintenance costs.

Option #2 seemed promising, but the more we considered this approach the more we realized that a big part of the value proposition of a microservice is its atomicity. Creating two bound entities complicates issues like logging and monitoring.

Wrapping TensorFlow Serving in a Microservice

In the end we were able to get Option #3 working without much trouble. We were able to avoid building TF Serving from source because our containers were binary-compatible with the official TF Serving docker containers, enabling us to use docker’s multi-stage build pattern to pluck the TF serving artifact out of their image and stick it in ours:

ARG TF_SERVING_VERSION=latest
ARG TF_SERVING_BUILD_IMAGE=tensorflow/serving:${TF_SERVING_VERSION}-devel
FROM ${TF_SERVING_BUILD_IMAGE} as build_image
FROM docker.quiq.sh/quiqml-tf:2.4.4
COPY --from=build_image /usr/local/bin/tensorflow_model_server /usr/bin/tensorflow_model_server

We then start up TF Serving during our web server initialization:

tfx_child = subprocess.Popen([
"/usr/bin/tensorflow_model_server",
"--model_base_path=/models/ro_inferencer",
"--model_name=ro_inferencer",
"--rest_api_port=0",
"--port=0",
"--grpc_socket_path=/tmp/grpcsocket"
])

You might be wondering why we set rest_api_port and port to zero. We run our containers in host network mode. If we didn’t set these ports to zero, which disables them, TF Serving would actually bind ports on the host which is messy and doesn’t look very secure. Thankfully TF Serving supports a socket only communication mode (although this commits you to doing gRPC requests instead of REST).

For our healthcheck, we simply wrapped the TF Serving model status API.

What about Model Versioning?

TF Serving supports multiple model versions and transitioning between them. However, we don’t use this feature at all. The reason is because we handle all versioning concerns at our microservice layer. We like this approach better because our models often have associated pre and post-processing steps that are performed ‘outside’ the model, but should still be versioned with it since they must match the model’s training conditions.

Results

This approach has worked very well for our needs. The resulting microservice is indeed atomic and is indistinguishable from our other services. We can say confidently that the service is stable and has consistent memory use (no leaks). Wrapping another server within your microservice perhaps feels a bit kludgy, but for us it’s incredibly pragmatic and valuable.

It’s important to note we still haven’t leveraged GPUs for model inferencing, but we have no reason to believe that it won’t work. We’ll just need to run ML services in a separate auto scaling group with different hardware specs. Doing that would allow us to potentially reduce hardware costs through batch inferencing — but plain old CPU is sufficient for our current use cases, models & load.

Acknowledgement

This article, and the work underpinning it, was done in conjunction with my friend and coworker Talon Daniels.

--

--

Kyle McIntyre
The Quiq Blog

Family man, software builder, data scientist, Montana kid, homesteader.