How to run many deep learning models locally

Anthony Sarkis
Published in
4 min readMay 14, 2019


Code on github

Quick! Build two networks and have them predict on a webcam! Sounds hard right?

This models were trained on 20 images. Images shown are test data from webcam.¹

Motivation for this type of abstraction

Software is building blocks. As the saying goes “It’s turtles all the way down”. Much of the advancement in deep learning has been focused on the hard work of getting models to work well and perform in a research setting. Now it’s time to bring attention to how we use the models in a day to day environment.

Starting point — run a single model

When to use local

Running a model locally (instead of through an API call) usually makes sense for

  • processing a high volume of information
  • time sensitive data
  • data security

Design overview

Here I’ll walk through one way to run a model locally using tensorflow.

We will:

  • Perform a one time setup to load the brain
  • Run new images
  • Assume we wish to run multiple brains, and to keep each brain independent.

Code on github


The three parts we setup are:

  • The weights
  • The graph definition
  • A label map

Tensorflow’s saved model format bundles the weights and graph definition. Collectively we refer to all three as part of the Brain.

The setup process may be thought of as:

  • Getting the brain. While we want to run the brain locally, we assume the latest version is on a remote server.
  • Load brain into memory.

Getting the brain

  • Make a call to endpoint to get a link to download the brain remote file paths
  • Make the request to the remote file to get the model
  • Fiddle with the label maps.

Label map fun

There are at least 3 label maps.

  • file_id_to_model_id
  • model_id_to_file_id
  • file_id_to_name

The rationale for this is:

  • The name is considered arbitrary, so the real reference point is the file_id
  • The model_id is the sequential id, usually starting at 0, that the model has actually been trained on.

Given the volume of label is usually < 100 and relatively static, it makes some sense to simply have these dictionaries available for fast access depending on where we are converting from / to.

Load model into memory

  • Read the file we just downloaded
  • Get a session ready²
with tf.gfile.GFile(self.model_path, 'rb') as fid:
serialized_graph =
self.sess = tf.Session(graph=self.graph)

Great! Now we are ready to run the model on demand.


  • Open the image and read the data³
  • Run session
  • Parse high confidence values

The abbreviated version is:

image = open(path, "rb")
image =
image = tf.compat.as_bytes(image)

We then run the session and parse high confidence values

for i in range(self.boxes.shape[0]):   
if self.scores[i] > self.min_score_thresh:

Current algorithms usually have a bunch of low confidence predictions that must be discarded.

By default we create a Inference() object that contains Instance() objects for each high confidence prediction.

The goal being to make it easier to work with the output through a standardized interface.

Abstracting it

We abstract the brain setup into:

brain = project.get_model(
name = None,
local = True)

Then the run:

inference = brain.predict_from_local(file_path)

Code on github.


  • Clean abstraction for different deep learning methods, local vs online prediction, and file types
  • Designed for changing models and data. The same object you call .train() on can also call .predict()
  • Ground up support for many models. See local_cam for one example.

The goal is to be able to call a similar method, be it for an object detection problem, semantic segmentation, or some future method.

Two brains are better than one

Get two brains:

page_brain = project.get_model(
name = "page_example_name",
local = True)
graphs_brain = project.get_model(
name = "graph_example_name",
local = True)

We open an image from a local path and runs both brains on same image. We are only reading the image once, so you can stack as many as you need here. (of course memory and compute implications as you add more.)

image = open(path, "rb")
image =
page_inference =
graphs_inference =

Why many models?:

  • pages all look similar
  • what’s on the page will likely have a lot more variance, and require a lot more data.

So the trade off here is:

  • More compute
  • Less annotation
  • More flexibility — it’s easier.

This is the age old argument of new vs old. And historically, we tend to favor what’s easier.

Thanks for reading!


The SDK is a work in progress. There’s a lot of stuff in current version I’m not happy with yet! If you see any glaring issues or feature ideas please feel free to create an issue here.

1, As is standard the model was automatically trained. It was fine tuned from prior data. It was trained on similar images in the same book. If this sounds like “Cheating” — consider this the new age of working with deep learning, where we purpose build our training data to most closely fit our test distribution. It works!

2, Feed dict is considered not a great way to do it in some cases, and the whole setup here assumes we must construct the graph definition which sounds to be done differently in TF 2.0.

3, The online prediction at time of writing assumes encoded_string_tensor, which requires the acrobatics with using tf.compat.as_bytes()