Serve Your ML Models in AWS Using Python

GLAMI Engineering
GLAMI AI team
Published in
4 min readAug 18, 2020

Automate your ML model train-deploy cycle, garbage collection, and rollbacks, all from Python with an open-source PyPi package based on Cortex.

What’s to Rule the Endpoints?

It all started with modernization of a product categorization project. The goal was to replace complex low-level Docker commands with a very simple and user-friendly deployment utility called Cortex. The solution in the form of a Python package proved to be re-usable since we successfully used it as part of our recommendation engine project. We plan to deploy all ML projects like this. Since GLAMI relies heavily on open-source software, we wanted to contribute back and decided to open-source the package, calling it Cortex Serving Client. So now you can use it as well.

Model Manager Architecture

The product categorization project used a low-powered manager instance to publish an external API and to trigger training and prediction in a cluster as needed. The cluster itself was isolated from the external world, but this may not be desired in all cases. This architecture is practical in cases where scaling or temporary deployment on costly instances is needed.

We decided to keep this architecture and adopt it also into the recommendation engine project. On that project, the deployments needed to be able to continuously receive prediction requests and be hot-swapped when a newly retrained model was available. That stood in contrast to batch operations of the product categorizer.

Model Manager Architecture
Model Manager Architecture

Update 2023–04–26: Cortex.dev was bought by Databricks

Consider migrating away from the Cortex.dev as it was bought by Databricks and is only maintained but not developed at the moment. Consider other options. You can do clusterless with directly EC2 with AWS cli scripts, or Fargate. Or with cluster maintenance you can think about popular choice Terraform. For deploying consider using e.g. Kubernetes Python client. But overall ideas in the post remain the same.

Cortex Is What?

Cortex is an open-source command-line utility that orchestrates the creation of Amazon EKS cluster (Kubernetes), deployment, scaling, serving, and load balancing of the model endpoints in the cluster. The primary abstraction in Cortex is an API. An API has a name, scaling, computation requirements, and a Python module with a PythonPredictor class. Once deployed, each API serves a single HTTP POST endpoint. When you call this endpoint, the JSON payload is passed as a dictionary to an instance of PythonPredictor.

Cortex architecture diagram (Source: Cortex docs)
Cortex architecture diagram (source: Cortex docs)

Cortex Serving Client

That is all nice, but how do you manage the deployed models from Python? Cortex provides only a command-line client implemented in Go! The natural solution was a Python wrapper around the Cortex executable. The Python “with” statement even allowed us to auto-remove deployed endpoints when they were no longer needed.

Code snippet of deployment using Cortex Serving Client
Code snippet of deployment using Cortex Serving Client

The Banana Skin of AWS

But then we remembered the famous banana skin of AWS: the forgotten instance cost! What if an endpoint is accidentally stuck alive on an expensive instance? We added a database table allowing Cortex Serving Client to keep timeout for each endpoint. A garbage collector then periodically removes all expired or unknown APIs from the cluster.

Rollback How?

Just a second, what about model rollback? Rollback is a good use case for our client. We implemented a simple strategy of returning to the previous version of a model into the recommendation engine model manager. We caught exceptions raised by the client during deployment and re-deployed from a backup by calling the client’s methods.

Get the Client

If you are looking for a solution to your MLOps, we recommend looking at either Cortex or BentoML. Cortex is an opinionated end-to-end solution, while BentoML is more generic in terms of deployment as it supports more platforms. However, BentoML also seems correspondingly more complex to set up. On the other hand, BentoML is implemented in Python and has a model packaging capability. If you choose Cortex like us, consider using Cortex Serving Client. Read up more about it over at Github.

Cortex Serving Client logo
Cortex Serving Client logo

What architecture do you use? Have you faced similar challenges?

Subscribe for more blog posts!

Are you interested in working for GLAMI? Checkout our job listings!

Update 2021–07–20

Cortex evolves very quickly. In version v0.21.0 Cortex added official Python client inspired by our project. Our Cortex Serving Client still offers some additional features like temporary deployment, timeouts, garbage collection not present in the vanilla, so we continue to use it and maintain it.

Update 2023–03–22 — Cortex.dev bought by Databricks

The overall principles in the post stands, but I recommend migrating away to other solutions now. Consider for example Kubernetes Python client.

By Václav Košař

--

--

GLAMI Engineering
GLAMI AI team

Blog of Machine Learning and Software Developers from GLAMI. We post about Python, AI, and more.