Serving Custom Python Machine Learning Models to Production

Or “How to build your own model serving platform when just scikit-learn and statsmodel are not enough”

Flavio Altinier

Published in

Pier. Stories

10 min readFeb 18, 2022

Data Science does not scale on its own

Every startup eventually comes across this problem: you have hired a couple of Data Scientists and they have found a tremendous opportunity to solve some big business problem by using Machine Learning. They hide in their cave for a week or two and then come out with a trained model capable of wonders, that “just needs” to be attached to some existing production pipeline.

So they sit with a team of Software Engineers and adapt that specific existing pipeline to be able to run that specific single model. Everything works fine for a while, and business numbers start to show significant improvement.

However, as times goes on, the model starts performing worse and worse. Every time the Data Scientists need to make any changes to it, the same Software Engineering team has to be called up and stop whatever they are doing to help deploy the new updated model.

At the same time, other business problems to be solved by using Machine Learning are being found, new models are being built, and new Data Science + Software Engineering teams are being created to deploy those specific models to production.

It is not hard to see where this is going. Before you know it, your back-end architecture is becoming a Frankenstein’s monster, with models spread all around without any centralized governance or management. Deploying new models gets more expensive every time, and things get easily out of hand.

So how do you solve that?

Off-the-Shelf Solutions

The architectural solution the market has found is to centralize model inference in a single service (or set of services), transforming what would be a .predict() call into an API call.

That sounds simple, but it is pretty clever solution: no more worrying about cross-language model invocation or deep dependency between Data Science and Software Engineering teams.

What the market has not done well, however, is cover more complicated use cases. If what you need is train a simple scikit-learn model and just transform its invocation into an API, then there are multiple decent solutions out there — MLflow Models, AWS SageMaker or Cortex (although this last one is just useful when you have a really high request volume).

Those solutions fall short when you need something a little more specific, though. Here at Pier, we found out none of the above solutions easily adapted to our use case, and in the end decided to build our own model serving platform.

That platform turned out to be relatively easy to build and its infrastructure super cheap (much cheaper than the off-the-shelf solutions), so we thought this might be a good opportunity to share our experiences.

Pier’s business requirements

The main reason we wanted to deploy ML models to production was for our Pricing pipeline. As an insurance company, pricing can vary a lot from quote to quote, and we have to make sure everyone is getting the best pricing possible, as fast as possible.

In the end we came up with a generic solution that can hold any kind of custom model, but everything that is generic is harder to explain. So for the rest of this article we will continue developing the Pricing anecdote, and migrate the narrative to the generic solution once the basics are covered.

Well, our Pricing model’s business requirements were quite extensive:

Serving: models must be highly accessible as a REST API that our back-end can query whenever a new quote is generated. The response must be immediate (few milliseconds).
Customization: calculating Pricing is complicated, and there are a lot of business rules involved — what makes simply using scikit-learn not feasible. For example: if a user quotes a vehicle that our pricing model does not know, it has to inform the back-end system that the problem is an “Unknown Vehicle”, so we can broadcast that problem back to our Product team (and sometimes to the user themselves).
Going back in time: say we want to test old models to check what would pricing a few weeks ago look like — those must also be accessible through the REST API. So models should be immutable, and once a model is sent to the platform, it must stay there forever.
AB testing: we may want to test multiple Pricing models at the same time, and verify which does best against some objective function.
Auditability: every inference’s inputs and results must be saved, and must be easily accessible for future analysis.
Input validation: as our Pricing model must return the reason why a price could not be calculated, and that will potentially be informed back to the customer, we cannot run the risk of a simple ill-configured input. So our Serving platform must validate the request body before calling the model.
Autonomy for Data Science: Data Scientists must be able to tinker with the models as they please, without the need of involving any Software Engineer.
Model Testing: as models will be part of the critical path of our product, they must be thoroughly tested. We cannot afford a buggy model breaking Production

The solution

The solution came in three pillars, in a mixture of process and technology:

A new model creation/maintenance/management process
A model store
A new back-end service that manages the dynamic APIs.

The new model creation process

Being a startup, many of the steps in our CRISP-DM are still a Work In Progress. To make the solution work, however, we had to establish a standard for the Modeling phase.

We made it official that only the Data Scientists themselves would be responsible for model management. Deciding which models should go to production, which model version is which, governing inputs, those are all Data Science team’s responsibility. They may use specific tooling for that, or just govern their own jupyter notebooks. Their call.

What is important is that we created the rule that once versioned, a model should be immutable (just like software package releases). If a model needs an update, a new model version should be created.

Once the Data Scientist believes their model is ready to be officially released, they must serialize it to a file — however, serializing and packaging custom python models is not trivial.

When using just sklearnor statsmodel it is is easy to serialize models simply by using pickle . You can save the model object as a pickle file and load it in any other python environment that has sklearn or statsmodel imported, and things magically work.

When creating custom models it is not so easy to share objects between different environments, though. Imagine you create a custom class that makes some tweaks in the way a sklearn default model works , and instantiate an object from it — if you just pickle that object, when you try to load it into a different environment, python is going to complain it does not know the object class declaration.

That can be fixed by using dill instead of pickle. The dill format serializes not only the object, but also its class declaration. It creates a file that can be loaded into pretty much any other python environment, which is extremely useful to our Pricing case.

So, to create a model release, the Data Scientist must follow two steps:

Submit the model creation code (including input values) for Peer Review
Once approved, create the dill file for that version of the model. This immutable dill file is the official model release.

The Model Store

Okay, so now the Data Scientist has a dill file and they want to serve to production. How do they do that?

First, we need somewhere to place that model, so that our model serving platform can find it.

As most of our infrastructure runs on AWS, the logical and easiest solution was to just place it in a bucket in S3 — and that is exactly what we did. We created a model-store bucket in S3, and that is the repository where we keep all of our serialized models (in the format of dill files). Different model versions are simply distinguished by filename (also, just like software package tagging).

pier-models-platform

pier-models-platform is the most important part of this pipeline, a back-end app built in python, using Flask. It has three main responsibilities:

Open a dynamic API for every model in the model store
Check every request’s input against some set of rules
Persist in a database the inputs and result of every inference

As soon as the app is deployed, it searches the Model Store in S3 for new models, and downloads them. It then deserializes every model to memory, where they keep waiting for .predict() calls. We keep that deserialized model store in a python dict, where the keys are the models’ names and versions, and values are the deserialized objects.

From then on, every action in pier-models-platform is managed by its JSON config file:

{    
    "model_1": {
        "active_model_versions": [
            "2.1.0",
            "2.1.1",
            "2.1.2-variant-a",
            "2.1.2-variant-b",
            "2.1.3"
        ],
        "default": "2.1.3",
        "schemas": {
            "2.1.X": {
                "name": "model_1",
                "input": {
                    "type": "object",
                    "properties": {
                        "feature_1": {"type": "number"},
                        "feature_2": {"type": "string"},
                    },
                    "required": [
                        "feature_1",
                        "feature_2"
                    ]
                }
            }
        }
    }
}

There is a lot of info in there, so let’s analyze it in parts:

Right at the root, we have a model_1 key. That key (and all its possible siblings) determines the configs of that particular model. In our Pricing use case, this would be something like auto_pricing.
active_model_versions tells pier-models-platform which model versions to look for in the Model Store. Every model in this list receives a special dynamic endpoint in the service, and can then easily be called through the REST API.
default : if the API caller does not specify the model version they want, this will be the version called.
schemas : we take advantage of JSON Schemas to document the expected input format of the API request. Every request sent to the API that triggers this specific model_1 , for versions 2.1.X , will be tested against this schema before being sent to the model. Is the validation fails, the caller gets a 422 response.

Real-time Inference

Requesting the platform is fairly easy: the requester must send a POST with the expected features as input, the model to be used, and its version. For example, the POST payload for our model_1 example would look something like this*:

{
    "model": "model_1",
    "version": "2.1.3",
    "input": {
        "feature_1": 10,
        "feature_2": "foo"
    }
}

When requested, pier-models-platform will follow a series of steps:

Deserialize the JSON payload and save it to a Python dict
Use the payload’s model and version to find the expected JSON Schema, and validate the input against it
If validation is correct, find the correct model and version in the deserialized model store, and call its .predict() method using the payload’s input object as parameter
In the platform’s perspective, the model then works as a black box. In the end, the model will return whatever is was programmed to (in the case of custom models, such as our Pricing model, this return object can be pretty complex)
Save the inference to a relational database: the timestamp, model and version, and input and output objects
Serialize the model return object and send it back to the requester

Deploying models to the platform

That system design gives complete autonomy for the Data Scientists themselves to deploy models to production, without the need for help of the Software Engineering team.

Let us say a Data Scientist needs to develop a new model version and send it to production. They would follow a pretty concise series of steps:

Develop the new model version in their notebook or python script, following whatever rules the Data Science team has created for their own governance
Once approved by pier review, serialize the model to a dill file
Upload the dill file to the Model Store
Edit pier-models-platform’s config file, adding the new model version and JSON Schema, if needed
Create tests for the model version using pytest
Then create a Pull Request with changes

And that’s it! Once the PR is approved by our MLOps team, our CI/CD pipeline kicks in and deploys the model to production.

Analyzing Results

So how do we know models are working as expected? There are two different kinds of monitoring:

IT Monitoring: in our back-end’s perspective, pier-models-platform is just another microservice. So it takes advantage of all our standard IT Monitoring tooling — Rollbar for errors, EFK for application logs, New Relic for response times, etc.
Model Performance: as every inference is saved to a relational database, it is fairly easy to send inference data to our DataLake, as we have sown here. Hence, Data Scientists can leverage all the analysis tooling we already have to measure model performance, such as accuracy or recall. This is also the way we analyze AB tests’ results.

To sum up

That infrastructure took a long time to be designed — but once decided, it was fairly easy to implement. A senior MLOps developer took around 3 weeks to build and stabilize pier-models-platform, and a Senior Data Scientist took around 2 weeks to build the first Pricing model release.
Being just a simple Flask application that routes model calls, pier-models-platform is super fast and lightweight. By the time of the first deployment, we just needed a simple Heroku dyno to host it, costing around $ 25 / month. Yup, you read that correctly :)

*The actual payload would be wrapped by a JSON:API serializer, as it is the standard communication format we use at Pier.