How Cimpress Delivers Cloud Inference for its Image Processing Services with MXNet Model Server

Published in

Apache MXNet

5 min readJul 18, 2019

This article discusses the problems our team at Cimpress™ had developing cloud inference services and how MXNet Model Server helped solve those problems.

The Problem: Building Reliable and Scalable Cloud Inference Services

Our Image Processing Algorithms (IPA) team at Cimpress build and own multiple hosted neural network-based services. These services provide various image processing functionality via neural net inference implemented with MXNet.

The services fall into roughly two categories:

“Online” services that need minimal overhead latency and high throughput. A client that calls these services needs a real-time response.
“Offline” services where high latency is acceptable. Clients may call these like a batch service, e.g. 2–3 calls a second, with an occasional spike of 200–800 requests across a short duration.

As we started building these services, we sought solutions that would be:

Reliable, stable and scalable
Able to support GPU operations
Able to serve both our online and offline needs
Easy to test locally and deploy the same solution on the cloud.

Training and building neural networks is relatively new and after a couple of google searches, there was no quick and easy consensus on how they should be deployed. Further complicating our situation, IPA’s models require the use of powerful GPUs to return a result within an acceptable time frame. This means expensive hardware and expensive hardware means extra care so that costs do not outweigh value provided.

The Process: How We Tried (And Failed) To Build Our Own Solution

Over the course of a few months, the team heavily evaluated build versus buy. We tried out some third-party solutions and Amazon Web Services™(AWS) provided solutions that worked, but not well enough for our needs or at our desired price point. We will focus on our “build” approaches.

Our initial build solution was a simple Gunicorn/Flask app deployed on Elastic Beanstalk. We went this route because it was:

Simple
Quick and easy to get it up running
In python, the language of our inference code.

We tested this with some sample inputs from one of our clients and saw that it performed reasonably well and actually had decent throughput. We happily went ahead with this as our production solution where it happily fell over almost immediately. Under higher loads and with much larger inputs than our tests, the service would continually run out of GPU memory. We were able to eventually get this stable, but the stability came at the cost of throughput. We had to over-provision the GPU fleet to process requests without hitting timeouts or 503 errors.

*A stripped-down high-level view of our architecture*

In parallel with stabilizing the Gunicorn/flask, we continued evaluating other solutions for throughput, latency, cost, and stability. This was a long and exhausting process where more time was spent investigating, prototyping, and testing than developing a solution that could get the team unstuck. When we came across MXNet Model Server (MMS), it felt like yet another solution to throw on the pile and we were pretty close to just committing fully to making the flask app better. Still, we tried it out as a proof of concept and compared it with our existing flask server.

The Solution: MMS And Why It Worked

We tested MMS against our flask server on p3.2xlarge machines. We saw roughly 2x to 2.5x throughput.

That increase was more than enough of an incentive to pursue MMS.

What is MMS?

MXNet Model Server is an open-source model serving framework, that is production-ready and supports running inference on models of all major ML/DL engines. It provides an easy to use interface to load a ML/DL model, trained using any framework, and provides a unique endpoint per model to run inference requests against. MMS manages the complete lifecycle of loaded models, ie., customers can load and unload models at run-time. For more details about MMS and its features, please visit the project page

Why MMS?

Having been on the lookout for a long time, MMS had what we needed:

It was easy to package our inference code and deploy.

Our inference code was already python and integrating it as a custom MMS service was quick.
We were able to quickly test this deployment setup locally and push the same setup for our production purpose on the cloud.
We had become familiar with using Elastic Beanstalk and were able to quickly deploy MMS to Beanstalk in a Docker™ environment.

We saw 2x to 2.5x throughput over our Flask solution.

The benefits of a high-performance java server backed by python request handlers meant we saw significant performance gains without having to port our python inference code.

It was stable and could queue requests.

We had online and offline use cases and were not able to quickly convert our architecture to using a message queue. MMS’s on server queue let us meet our online and offline use cases.

It was easy to configure and monitor

Queue size was easy to modify to handle offline uses (up to a thousand inference requests)
Easy to configure the number of workers per machine, important for models with large memory footprints, although it is tricky to find the right number, you will need to do some trial and error.
Built-in server logging made tracking errors easier and slimmed down setup time.

Overview of our architecture

Thanks to finding MMS, the general architecture of a processor ends up being pretty simple:

Our inference code and models are packaged as an MMS custom service.
The custom service gets copied into a docker image with an MXNet + CUDA™ ready environment setup.
The docker image is deployed to EC2 using Deep Learning AMIs and Elastic Beanstalk for Docker environments.

About Cimpress and IPA squad

Cimpress N.V. (Nasdaq: CMPR) invests in and builds customer-focused, entrepreneurial, mass-customization businesses for the long term. Mass customization is a competitive strategy which seeks to produce goods and services to meet individual customer needs with near mass production efficiency. Cimpress businesses include BuildASign, Drukwerkdeal, Exaprint, National Pen, Pixartprinting, Printi, Vistaprint and WIRmachenDRUCK.

Cimpress Technology is a central organization that supports Cimpress brands with technology related to mass customization. The IPA Squad focuses on image processing algorithms that use machine learning and vision techniques to improve the ability to apply customers’ artwork to different decoration technologies.

Acknowledgments

Thanks to Ajay Joshi, Brian Hanechak, Phillip Graham and Vamshidhar Dantu for feedback and contributions.

Cimpress and the Cimpress logo are trademarks of Cimpress N.V. or its subsidiaries. All other brand and product names appearing on this announcement may be trademarks or registered trademarks of their respective holders.