Evolution of ML Platform @ Varo

Ritesh Agrawal
Engineering @Varo
Published in
6 min readOct 13, 2021

Authors: Ritesh Agrawal, Mia F, Vinod Pazhettilkambarath, Viral Parikh

Putting ML models in production requires a robust platform. In an earlier post, we discussed the design and architecture of Feature Store, one of the critical components of ML Platform. In this post, we walk through the journey of building our ML Platform.

ML Platform V1

As a growing startup, our initial ML Platform was a minimalist solution solving the online deployment of ML Models. As shown in the figure below, we were leveraging Kubernetes clusters to deploy pre-trained models as a service. The pre-trained models were packaged as part of a docker container and further contained a web service to expose the model as a service. Additionally, we built a model service that re-routes requests from banking applications & Kafka Events to various ML models. Having model service in the middle allowed us to manage models and endpoints without impacting dependent applications.

Figure 1. Our initial ML platform was an event-driven online system used for both online and batch prediction. With business growth, the above architecture started having scalability challenges, especially those related to batch processing.

While the above platform helped jumpstart our ML journey, it had many challenges:

  1. Auto-Refresh: Models were trained using various systems ranging from laptops to distributed systems, and the serialized models are deployed using a Flask-based web service in a docker container. These docker containers were then deployed on AWS Sagemaker. However, this ad-hoc process of building and deploying models was a big problem from a long-term model maintenance perspective. The underlying scripts used to generate serialized models were inconsistent across projects. As a result, knowledge transfer and reproducibility were significant challenges.
    Related to reproducibility was another problem, namely: “auto-refresh.” An effective ML model constantly needs to be retrained with the new data. At Varo, we refer to the retraining of models as “Model Refresh.” Since there was no standard way to build and deploy models, the above platform had no provision for auto-refresh of any model. It was completely up to individual data scientists on when and how they wanted to refresh their models.
  2. Batch Mode: One of the challenges of building an ML Platform is two different modes in which ML models are used: Batch and Online. In Batch mode, models tend to be short-lived; they are initialized, run predictions on many data points, and then released from memory. In contrast, in an online mode, the model is always live. The above platform was primarily designed for the online usage of ML models. However, as our use-cases increased, we had several instances where we needed the same online models to be used in batch mode from our data lake environment. In the above platform, we addressed this issue by leveraging Kafka. All the data from our data lake was announced to Kafka, which was then consumed by the model service. The predictions were then transmitted back to the data lake using another stream of Kafka events. This solution worked only as long as our data volume was small. For large datasets, such an approach was neither scalable nor efficient. We were sending millions of data points through Kafka in burst mode, causing clogging of the network, database, and services.

ML Platform V2

The first iteration of ML Platform taught us many things:

  1. From the long-term model maintenance perspective, it is essential to standardize how models are built and deployed. Thus the scope of ML Platform is not limited to only serving models, but also training models.
  2. A holistic ML Platform needs to support both batch and online usage of ML models. The two usages are very different and require different tools and technologies; one cannot just be hacked to solve the other.
  3. We wanted to build a platform-agnostic inference architecture, i.e. the models can be deployed easily from third-party clusters to our own managed clusters.
Figure 2.0: The ML Platform standardizes model training and deployment. Data scientists are required to provide model artifacts. The platform leverages the model artifact to train the model and deploy it either for Batch Inference or for Online inference.

The component diagram above shows our new machine learning platform. There are six key components of our platform:

  1. Research and Development: Any ML Project starts with some ad-hoc exploration of the data and testing various features and models. There is no predefined process and this stage involves a variety of tools ranging from IPython Notebook, documents, sheets, databases, etc. As a platform, we provide several internal libraries to connect and interact with various data sources and our feature store, but also distinguish research from actual productionization of the model.
    In order to put a model in production, the platform requires a “model artifact,” a python library with a prescribed list of classes and modules.
  2. Training Platform: The training platform is responsible for standardizing how models are trained, managed, and further refreshed. Unlike V1, where data scientists uploaded a pre-trained model, the new platform requires data scientists to upload a training script. The training platform defines the input and output parameters for the training script, and one of the expected outputs is a serialized binary trained model file. This binary file and additional metadata information such as python version, installed libraries, etc., are saved in our model registry. Standardizing the model training opened up the possibility of automatically refreshing models at scheduled intervals or being triggered through some external mechanism.
  3. Batch Inference Platform: Learning from V1, where we hacked our online inference for batch inference, the new ML platform natively supports batch inference. We leverage the “model registry” to initialize the trained model on all PySpark executors and further expose them as a PySpark UDF. Furthermore, we leverage Pandas UDF to take advantage of vectorization. As a result, now we can process millions of batch inference tasks within minutes.
  4. Online Inference Platform: Like the Batch Inference platform, we leverage the model registry to identify all the necessary configurations required to create a docker container; the base Docker container is prebaked with a web service. The docker contained is then deployed as a service on our Kubernetes cluster. Similar to V1, access to these models is centralized using a model service. While using a model service creates a single point of failure, it helps centralize coordinating network calls between different services (such as feature store, models, and applications) and centralizes logging. Further, having a model service in between applications and models allows us to experiment with different ML models without impacting our applications.
  5. Monitoring Platform: Monitoring is one of the key aspects of ML Platform. We log all inputs and outputs in a central database and further compute various statistical metrics to monitor for drift or breaches of pre-defined thresholds.
  6. Feature Store: Orthogonal to the above component is the Feature store. Feature store is used to extract features during research and development stages, but also during training and inference as well. Our feature store leverages lambda architecture and is discussed in much more detail over here.

Moving Forward

Building a holistic ML Platform has become more of an integration challenge with a plethora of tools and technologies already available. As we transitioned from one ML Platform to another, one key lesson learned is identifying and defining key components of your ML Flow and standardizing interactions between them. In our case, we decoupled training of models from the usage of models in different modes (Batch and Online) and further defined interactions between them. The work on our ML Platform is not yet done, but we hope that splitting our platform into the above components makes it flexible for us to adapt to new use cases.

--

--

Ritesh Agrawal
Engineering @Varo

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library