How to Build an ML Model Registry: A Step-by-step Guide from Opendoor Engineering
By: Chongyuan Xiang
With the advance of open source libraries, such as scikit-learn, catboost, and PyTorch, training a machine learning model has become much more approachable. You can load your data, call the fit function from one of these libraries, and you have a trained ML model!
Does that mean the job is done? Unfortunately, in practice, getting a model trained is only the start of the journey. You’ll also need to systematize model training and metric collection, safeguard against bad models going into production, and ensure model performance consistency in a distributed system, among other things.
At Opendoor, machine learning is foundational to our pricing model, which allows us to adapt to changing market conditions while also making competitive offers on homes so we can serve more customers. We are constantly improving the accuracy of our pricing and risk models, which is why we built a system called Model Registry that makes it easier to put machine learning models into production.
Model Registry is a service that manages model artifacts and tracks which models are deployed in production. In this blog post, we will introduce the core concepts of Model Registry by walking you through an example related to predicting home prices.
Let’s Start with Something Simple
Assume we have a linear regression model predicting home prices trained with the scikit-learn library. The goal is to create an API that uses the model to serve predictions. We can start with a simple system as follows.
We train the model once and serialize it into a file called model.pkl. Specifically we use pickling in the Python world to do the serialization. After that, we store the file in S3, Amazon’s cloud storage system. Then we build an API service that reads the file from S3, deserializes the model, and starts to make predictions.
This design pattern is actually common in the industry, but also has a few main drawbacks.
- The model might get stale over time, especially if the data distribution changes over time, which is often true for the housing market.
- Because the model is only trained once, the training code will not be well maintained, which makes it harder to reproduce later.
With that in mind, we would like to move to a world where we train models regularly. It is easy to set up a cron job to do that. However, it is also crucial to track the history of different runs of model training. The things to track include training dates, features, hyperparameters, and performance metrics. With those, we can compare with historical benchmarks and understand how good a new model is.
This is where Model Registry comes in. It is a service that helps us manage multiple model artifacts. Users interact with the service through a gRPC API to log and retrieve their models. The service has the following core concepts:
- TrainingInfo: All the information related to the training and backtesting of one model artifact. It contains information such as model type, artifact S3 path, features, and creation time.
- Parameter: A key/value pair representing a model training parameter, such as the city for which the model was trained.
- Metric: A key/value pair representing a model evaluation metric, such as MSE (mean squared error).
One TrainingInfo is associated with multiple Parameters and multiple Metrics.
Along with the gRPC API, we built a UI to let users search TrainingInfos and compare Metric values. The screenshot below shows a chart comparing the MSE of models over time in different cities.
Redesign the API Service with Model Registry
Now let’s go back to our original problem to design an API service that serves home price predictions. We will design the system again with Model Registry. The diagram is as follows.
For model training, we have a cron job that trains the model periodically. After a model is trained, we serialize its artifact to S3, and create a new TrainingInfo in model registry. The S3 path is included as a field of the TrainingInfo. The API service loads the latest TrainingInfo from Model Registry, gets the S3 path, and deserializes the model artifact from S3.
Now that our system always loads fresh models, and the history of model training is well tracked, what else do we need? The answer is, a good monitoring system. There will always be days when the trained models are problematic because of either data issues or code bugs. We would like to catch them as soon as possible, ideally before those bad models are in production. This leads us to the subject of model validation.
We will set up model validation as an extra step in our offline process. Each time a model is trained, we compare it with the last trained model. If they look similar, we say the new model passes validation. Some example validation criteria include:
- The two models have similar coefficients.
- By replaying past prediction requests, the predictions from the two models have similar distributions.
To accommodate this new process of model validation, we add deployable as a field of TrainingInfo in Model Registry. If a model passes validation, its deployable is set to true, otherwise false. In production, the API service will always load the latest TrainingInfo with deployable = true.
Beyond that, the Model Registry UI shows all models that failed the validation and the reasons for those failures.
After adding the model validation step, the API service design looks like below.
Model and Code Compatibility
The models we put into production are now validated. That is great! But will they be compatible with the code of API service? An example of code and model incompatibility is that the code might attempt to send 10 features to the model, but the model was trained with 11 features. This leads us to the subject of continuous deployment (CD) pipeline.
Continuous deployment (CD) refers to automatic deployment of software when there are new code changes that pass the testing phase. In our case, we will set up a CD pipeline for our API service, and run acceptance tests whenever there is a new code commit. In the acceptance tests, we load the model and call the entry point function of the API service to replay some past requests. If there are no exceptions thrown and the responses look normal, the acceptance tests will pass.
In the next section we will also show how to make sure that the production code and the model are always compatible even after the API service instances crash and restart themselves.
Final Step: Deploy Models in a Distributed System
We are now one step away from our final design. There is one remaining question to answer: if we have a distributed API service with multiple instances in order to do load balancing, how do we make sure that those instances all load the same model?
What is wrong with this current approach? Is it always safe to simply load the latest TrainingInfo which has deployable set to true? Think about the following scenario where we have two instances named A and B. A and B are deployed at the same time, and both load the same model called M1. After a while, a new model M2 is trained and validated. After some time again, instance A crashes and restarts itself, therefore loading the new model M2. Meanwhile, instance B continues to use the original model M1. Inconsistency between the two instances appear.
In order to mitigate the issue, we create a new concept called Deployment in Model Registry. It links between git shas and models, and records which model should be deployed. In the CD pipeline, a new Deployment is created in the model registry after acceptance tests, and before an actual API service deployment. This binds each code commit to a version of the model that we know, via acceptance tests, is compatible. The API always loads the Deployment which passed acceptance tests and is associated with the application code’s git sha.
Consistency is preserved among different instances because a new model can only be served with a new Deployment in model registry, which is always followed by a service deployment to all the instances. Also, the production code and the model are compatible after any API service instance restart itself, because the instance will always load the same model as before the restart.
Hopefully, you enjoyed the journey of putting our toy model into production! As you saw, there’s a lot of complexity in serving a model robustly, correctly, and consistently. However, this is just the beginning — there are many other interesting areas of model “productionization”, including:
- Online model monitoring after a model has already started serving in production.
- Ensuring data used for training is consistent with the data used for serving.
- Debugging a model in production that starts making bad predictions.
If you’d like to work on those problems or learn more about machine learning, check out the Opendoor careers page for more info!