Racing through the long and winding road to ML productionization

Francisco Rodriguez Drumond
The Glovo Tech Blog
10 min readJan 26, 2021


A few years ago, Data Science was called “the sexiest job of the 21st century”. It still is pretty sexy, but Data Scientists are being every day more challenged to show they can deliver and generate value. What use is it to have a very clever and top-performing model sitting on your laptop if it’s not improving your business? There are many things we Data Scientists (DS) can do to demonstrate the value we bring: having a clear roadmap from a proof of concept to a product; making sure the relevant data in your organization is gathered, organized and stored with some guarantees of quality; and the much-dreaded question Data Scientists need to answer: how do I put my notebook model in production?

At Glovo, we employ Machine Learning to improve many different aspects of our business: recommendations, fraud, customer and couriers churn, lifetime value, product tagging, etc. In the Routing team, we work on Machine Learning (ML) models to predict the amount of time between different important events in the lifecycle of an order: the estimated time of arrival we show customers, the food preparation time, the time it takes a courier to move from one point to another, etc. These models are then used for multiple use cases to improve our operations efficiency and our partner and customer user experience: who’s the best courier for each order? when do we notify a partner about a new order? what ETAs do we show our customers and partners?

Glovo operates in different corners of the world and each of these has very different distributions for our ML models. We found that rather than having a global model for all cities, it pays to have models trained and deployed for different geographies. This means that every week we train and deploy hundreds of different models. These models are used every time a person browses the Glovo Catalogue or makes an order, so they must serve predictions in real-time with very low latency and need to scale to serve up to tens of thousands of requests an hour. Because of this, productionizing is a key aspect of our team.

In this article, I will talk about three things that they don’t normally teach when you’re learning Data Science and which will make your life easier, especially if you don’t have a Software Engineering background.

1. Ditch the notebook and use a working environment as close as possible to your production environment.

There are plenty of DS tools that are made to make learning and the first part of a Data Scientist’s job easier. You install Anaconda with a set of pre-defined ML libraries, you launch a notebook and you are ready to start processing data, visualizing it and training models. Doing that is great for exactly the first part of your project. As you productionize your model by shipping them to another machine there are many reasons why the same model can behave differently.

One of the most common phrases in the Software Engineering world is “it works on my machine”. You know for sure that the code you are running in your machine and the production instance is the same. But your code is not the only code that’s being run: you are also using other software that might be slightly different between your machine and where your code will be executed: the operative system, version of Python, and any libraries you are using. Also, the configuration of your computer might be different.

To address this, you should move to have a working environment as close as possible to your production environment. This means for instance using Docker containers, which are a safe way to ensure all software dependencies are the same. With Docker, you can write a recipe to create an environment with exactly the same configurations and software in any new machine. You could think of containers as python virtual environments (which Anaconda allows you to use), but they also let you specify any extra configuration and non-python software dependencies you need. Ideally, you would use the same container for developing your model, training remotely and deploying it.

This leads me to another best practice: being mindful of the versions of the libraries you are using and being explicit on how to install them. You don’t want to code your neural network on Tensorflow 2.4 and then find out that the production environment is using 2.3. Most of the libraries we need use semantic versioning for indicating back-ward incompatible changes. A safe practice is specifying exactly what versions you need in your dockerfile or requirements file. This is known as pinning dependencies: if you are explicit on which versions you used to develop your model, you can ensure these are also used when training and deploying. Conservatively pinning to a minor or patch (eg 2.4 or 2.4.1) can save you from headaches.

As Data Scientists we need to stop thinking of a model, or a report with some good metrics, as our main deliverable. Sure, creating and tuning a model is the fun part and where we add the most value. But our deliverable should be everything that’s needed to make the model make predictions on a production environment, be it in an online or offline manner. If like us your model is making predictions in real-time, that means also writing the endpoint that will receive the input data, parse it, feed it to the model and then return the request. By writing all the software surrounding your model, then you can ensure that it works in production as expected.

Having a dockerized app as your deliverable takes time but in the end, it is the best way to make your life (and that of the DevOps/ Software Engineers supporting the model) easier.

2. Write tests to make sure your code works as expected

Tests are your insurance policy when your project scales. And it’s an insurance policy worth every cent. It might not be the most glamorous thing to spend your time on, but they are the safeguards against many different types of problems:

  • What happens if you write a pretty cool transformer to generate a new set of features, someone modifies it in a few months, and then it starts making wrong calculations?
  • Or if you are using a library like Scikit-learn or Pandas, decide to upgrade their versions to get a new feature and your code breaks?
  • And more simply, what guarantees do you have that your whole pipeline in production will make the same predictions as when training? (and trust me, there are many reasons why this would happen!)
Unit testing takes time…

Data Scientists don’t usually care about tests because tests are more of a software development thing. Also, if you don’t come from the engineering world you might think they are a complex or strange thing to do. But the models we develop are in the end code, and like any software component, they can break for many different reasons.

The easiest way to understand testing is to think of a test like a contract. Given some pre-conditions, when you execute some function or segment of your code then you get some expected output. As an example, suppose you are writing a StandardScaler transformer; given that you have a set of numeric values, when you execute the transformer, then the transformed values should have mean = 0 and variance = 1. Writing a test is as simple as writing a snippet of code that checks that with some data.

You can then think of a test as some code that will be executed automatically whenever you want to validate your code. Continuous integration/ continuous delivery (CI/CD) is another concept worth borrowing from the Software Engineering world. You may not be familiar with it, but in a nutshell, what we aim to do is have tests that are run automatically every time we change our code (ie when pushing new code to a git repository) or when releasing new versions of our models.

…but it saves you from a lot of trouble down the road

What types of tests do you want to write as a DS?

a. Unit Tests: we write unit tests when we write individual components. The StandardScaler example mentioned earlier is a good example of a unit test: by having automated tests with some synthetic data you can make sure your component continues working in the future even if somebody changes its code or the version of a library. The minute something breaks your component you will know and notify the person doing the changes.

b. Integration tests: integration tests are written to make sure that the different components of your code work together as expected. For instance, if you are deploying a model in real-time through a web-service, you might also want to write tests to simulate requests to a local end-point and ensure that the predictions you make are the same as when you trained the model. To do this we produce at training time what we call deployment verification samples, a subset of our data along with the prediction. Then, when deploying the model we can check that any of the steps involved in the endpoint before calling the model (data serialization and parsing) don’t change the behavior of the model. We do this by checking the prediction generated at training and ensuring it’s the same as the one using the endpoint. Why wouldn’t they always be the same? We’ve seen it, there are many reasons why this could happen.

c. Deployment tests: we also write tests that are executed when deploying a new model to check things like:

  • Latency (simulate a request with some data and ensure the latency of the service doesn’t increase).
  • Model size (make sure that the memory footprint of your model doesn’t unexpectedly increase).
  • Nullable fields: if your model has nullable fields, make sure a model ready to be deployed can handle these.
  • Model quality and predictions validation: you can check if the model error is too high before deployment using some small hold-out dataset or check if the outputs are valid or conform to some distribution.

Writing tests is an investment, but it does pay off when you’re iterating over a model. They act as an early warning to pin-point something is wrong before it affects your users, saving you from the hassle of long debugging sessions and telling you where and how to fix it.

3. Observability and planned releases are key.

Observability is another concept you want to borrow from Software Engineers. Once your model is deployed and shipped to production you want to be able to know how it is performing. There are three types of things you might want to know from your production model:

  1. Service performance metrics: Latency, timeouts, resources (memory/CPU/disk), number of requests.
  2. Logs: When you deploy your model remotely you want proper logging in place to understand if there are any critical or potential issues.
  3. Real-time model error/loss (if possible) and data monitoring: knowing if your model inputs, outputs or errors are deviating from their normal values can tell you that there’s something off in the inference pipeline or the features or that there is some drift in your data.

At Glovo, we use Datadog for centralizing that information. At any time, we can quickly look at the performance of our models and be warned of any problems. For instance, if we are serving real-time predictions, we want to be notified in case the latency is increasing. This allows us to react and quickly mitigate any issues: rolling back to a previous version of a model for instance or changing/fixing the scaling policy of our service.

A look at the Dashboard of one of our ML models.

This leads me to another and final point: releases. Every time you release a new version ML model you should have a clear strategy on how to do it and stay on top of it. This includes having a fallback strategy: what do you do if your model starts erroring or if it’s timing out? In our case, all of our models have simpler and fast fallback models (eg a linear regressor) to be used in case the main model fails. Also, thanks to the recent deployment of our model infrastructure on Kubernetes by our Machine Learning platform, we can quickly roll back any new version in case it’s necessary.

Packaging, testing and observability are three key concepts from the software engineering world that will make your productionization journey smoother. They all require a significant investment of time, but they will help you work better, understand the issues your model might have in production, and adapt faster. Plus, it will truly make Data Science the sexiest job of the 21st century for you.