Towards MLOps: Technical capabilities of a Machine Learning platform

Theofilos Papapanagiotou
Prosus AI Tech Blog
25 min readApr 26, 2021

Table of contents

  1. Introduction
    1.1 The workflows of data science and software development are different
    1.2 The ML pipeline has to include Continuous Training
    1.3 Model drift
  2. Feature Store
    2.1 Centralised data access
    2.2 Data Versioning
    2.3 Data pipelines
    2.4 Data labeling
    2.5 Feature repository and data discovery
  3. Training pipeline
    3.1 Model and experiment management
    3.2 Pipeline orchestration
    3.3 Automatic feature engineering, HPO and NAS
    3.4 Distributed training
    3.5 Coding environments and standards
  4. Production
    4.1 Model serving
    4.2 Data validation, model validation, evaluation and monitoring
    4.3 Deployment strategies
  5. Conclusion

1. Introduction

Large organisations rely increasingly on continuous ML pipelines to keep their ever-increasing number of machine-learned models up-to-date. Reliability and scalability of these ML pipelines are needed because disruptions in the pipeline can result in model staleness and other negative effects on the quality of downstream services supported by these models¹.

When a data science team runs a few ML models, having a manual process for training and deploying is sufficient, especially when these models do not have to be retrained frequently. Or in use cases where the models are used in a one-off exercise, such as clustering your customers into user segments or building a taxonomy for the product catalogue.

As the number of models increases in most organisations, we need to introduce the concept of Continuous Training (CT), which brings automation in the execution of an ML pipeline. This is a vital capability to keep your models up to date when the world around them changes, and this blog post summarises the categories of the tools that enable CT.

Many of the principles developed in the DevOps/SRE community, such as CI/CD and monitoring, can be introduced into the data science workflow with the help of the machine learning platform².

The best practices for businesses to operationalise ML workloads is defined as MLOps. In the Prosus group, we measure MLOps in levels of adoption of the mentioned techniques³. The transition from the manual process to CT and later to CI/CD for ML, reflects the maturity of how our portfolio companies deliver models and eventually grow their businesses.

Deploying a model to production is just the beginning of the lifecycle of a machine learning model.

1.1 The workflows of data science and software development are different

There are several key differences between the software development process and the data science workflow, resulting in different processes and tools for the latter. The software building process delivers an artifact. This artifact provides a functionality, a feature. The test/QA process signs the desired functionality, and the artifact is deployed in production. While operating the application, the discovered bugs are fed back to the development process, forming the circle of the continuous workflow.

Software engineering workflow

The data science workflow delivers a model through a selection process of combinations of algorithms and parameters. A model measured on metrics (such as accuracy), open to improvement. It is being deployed in production to infer knowledge from the data. This inference process is prone to the input data. And this dynamic relationship between the delivered model and the data, reflects the difference between the two workflows.

Data science workflow

1.2 The ML pipeline has to include Continuous Training

Building and maintaining a platform to support building and serving ML models requires careful orchestration of many components. The fact that data are part of the workflow introduces the need for Continuous Training (CT), as an addition to Continuous Integration (CI), Continuous Delivery (CD). This extra quality contributes to the step from DevOps to MLOps.

Continuous Training positioned in the ML pipeline

The standardisation of a general-purpose ML platform was first introduced in the TFX paper. It describes the building blocks of an ML platform in the form of components and presents the productivity boost resulting from implementing it. Deploying TFX can lead to reduced custom code and faster experiment cycles. It can also reduce the time to production from the order of months to weeks and provide platform stability that minimises disruptions.

1.3 Model drift

Another reason for the requirement of Continuous Training is the need for retraining in order to maintain high-quality predictions.

Inspired from databricks article on Productionizing Machine Learning: From Deployment to Drift Detection

In a production environment like Prosus businesses (payments, food delivery, classifieds, online education), the schema and distribution of the data are dynamically changing (new trends, habits). This behavior is defined as data drift. When statistical properties of the target variable change, the very concept of what you are trying to predict changes as well. For example, the definition of a fraudulent transaction could change over time as new ways are developed to conduct such illegal transactions. This type of change will result in concept drift.

Model drift is a term to define the combination of data drift and concept drift, as the whole environment on which a model operates influences its performance. Drift detection is addressed through monitoring of:

  • Schema & distribution of incoming data on the training dataset
  • Distribution of labels of the training dataset
  • Schema & distribution of incoming data of the inference requests
  • Distribution of predictions
  • Quality of predictions

2. Feature store

A feature is defined as a measurable property of a data sample. Example features are a pixel of an image, a word of a text, a person’s age, the average number of payments last week. They can either come directly from raw data tables and files, or they can be derived values computed from some data sources. Because features to train ML models are often encode business knowledge and context about the users or products, they can take time to understand and create. Having a way to store, share and ensure the quality of features for reliable (re-)use can save a lot of time.

Feature engineering and feature usage processes having being decoupled by a feature store

2.1 Centralised data access

There are many forms of data storage systems: data lake, data warehouse, data mart, and a feature store. They represent different levels of data management, and their differences are discussed below.

Data lake: the general-purpose storage of unstructured raw data like logs, images, emails, feeds and also structured data feeds. The input process of the raw data to the data lake doesn’t filter anything, and its size keeps growing.

Data warehouse (DWH): where a business keeps the processed data, the modeled and structured data. All parts of an organization utilize the data warehouse to store and retrieve operational metrics, business indicators, etc.

Data mart: a simple data warehouse with data from a specific number of sources used by one particular business department.

Broken data is the most common cause of problems in production ML systems.

The feature store is a special form of a data warehouse, used by data scientists to access data and manage features for ML. It contains the features which are processed data, like a DWH, consumed directly by ML models. Unlike a DWH, it provides robustness in retrieval, a low latency access channel to enable the fast inference as well as cost-optimised storage required by a frequent model retraining process.

The process of collecting and transforming data to features is done once, and others discover and reuse it. This sharing ability speeds up the ML workflow as the data engineers don’t have to re-implement that collection and transformation, which otherwise constitutes a significant portion of the effort spent. It also allows training other models by selecting different features and generate some clean training data.

The feature store essentially decouples the process of feature engineering from the consumption of these features (e.g. during model development). The feature engineering activity relies on domain experts that publish features in the feature store (and usually has a registry to support their discovery, see section 2.5 below). These features exist independently from the model development process so that they can be reused.

A key technical characteristic is also the split between training and serving consumption of features. A common consumption SDK supports the consistency across these two activities, although it provides two different technologies in the back:

  • A low latency online storage, in the form of a Redis cache, to serve fresh, real-time features during inference time.
  • A cost-optimized offline storage, to store historical data which support model training.

Popular feature stores include Michelangelo Pallete (Uber), Feast (Gojek and Google), Zipline (Airbnb) and Hopsworks (LogicalClocks).

2.2 Data Versioning

The data versioning ability of a feature store is a prerequisite to implement MLOps, the same way that code versioning is required to practice DevOps. There are two main approaches to implement data versioning, the git-style and the time-travel.

DVC is a popular data versioning system using git to store metadata about data files. Pachyderm, an ML platform on Kubernetes also uses a git-style method for versioning data. That brings the power of Git and Git branches to try different ideas as an experiment management system.

There are problems with these git-style solutions though. Git does not scale to store a large volume of data. It tracks immutable files and doesn’t store the differences between them. Finally, it can not answer time-travel queries on data in the form of “give me the value of that feature as it was on the 1st of January”.

Time-travel is a required capability in order to support incremental feature engineering. A transactional data lake provides this capability by storing versions of the schema and the data as transactions. That allows the computation of features that have changed in a particular time period. Common transactional data lakes are Delta lake (Databricks) and Apache Hudi.

The data versioning ability of a feature store is a prerequisite to implement MLOps, the same way that code versioning is required to practice DevOps.

2.3 Data pipelines

A data pipeline is tightly coupled with the data platform, the data lake. It is implemented with a workflow manager, specialised to run tasks on large amounts of data. A workflow manager usually comes with a scheduler to orchestrate the activities, has a graph describing each data flow and uses operators to perform the ETL tasks.

The concept of Extract/Transform/Load (ETL) is used by businesses in the world of analytics, BI and reports, and has been evolved over the last decades with the growth of data engineering and big data. The automatic transformation of raw data to features can be either done on a schedule or by triggers.

The data processing using such methods is a fundamental element for any organization which has been transformed from process-driven to data-driven. Beyond the analyst dashboards, in the vision of a model-driven business where the data science is embedded in the organisation, the data processing technology is still an enabler, as it provides the input layer of data to build the models that define the business.

DataOps and MLOps workflow, as discussed in this Logical Clocks article about “MLOps with a Feature Store”.

The choice of the technology and tool that delivers this functionality is crucial, as it’s a dependency of the ML pipeline. Traditionally, hadoop-based data lakes would have a workflow manager like Oozie or Azkaban to perform such activities. Projects like Airflow and Luigi dominated that space by providing an independent tool outside of the hadoop ecosystem, as companies moved their data and workloads to cloud-based data lakes. Currently, Airflow is the leading workflow management system for data processing.

2.4 Data labeling

Labeled data is frequently required to develop machine learning models. When this labeled data is not available, a data labeling activity may need to happen as part of an ML project, to create an initial training dataset. Within Prosus group, our classifieds business OLX has identified the importance of the labeling activity also at the end of the ML pipeline. For example, the OLX moderation team labels accounts flagged as fraudsters. They essentially validate the predicted labels of the fraud detection models, setting up the ground truth dataset for the next iterations of the CT.

Mechanisms to annotate data such a way can either be:

  • internal to an organisation
  • automatic through models
  • through external, public or private crowdsourcing

Based on the type of data to be labeled, there are plenty of open-source tools available in github curated lists. It is also possible to have a custom made UI for the particular labeling or validation. Having bots reporting the labeling or inference results instead of a UI is also popular. OLX uses slack channels, which consume inference queues, and the issue-label-bot of kubeflow brings that to another level, getting validation via the use of thumbs-up/down to the comments with the inference result.

In the space of crowdsourcing, many companies offer data labeling as a service like scale.com and Sagemaker groundtruth. Frequently their business model is to also use this data to train models that can automate the labeling process. As a more traditional method, Amazon Mechanical Turk and other similar services offer just the marketplace where workers meet requestors.

2.5 Feature repository and data discovery

The role of the feature store is twofold: It is a storage facility of the features produced through the feature engineering activity, as well as a marketplace where users can reuse the produced features. It is essentially a place where people share, annotate, curate, discover and consume processed data.

The toolset that delivers this functionality can range from a free documentation wiki-style system to a structured, well-organized complex database system. It can provide federated access to data lakes like Presto and Hive, as well as maintain the social graph of the consumers like Lyft’s Amundsen.

The portfolio companies of Prosus are quite mature on that, reflecting how seriously data governance is taken. At OLX, the Data Catalog tool is a discovery and access management portal, which also handles the data classification, data subscription and approval flow functions; hence it’s custom-built and maintained. At iFood, our food delivery business in Brazil, the adoption of Amundsen helped the organisation move away from a read-the-docs-based curated zone to a more structured graph-based knowledge base.

3. Training Pipeline

3.1 Model and experiment management

Model and experiment management is about tracking the model lineage, various versions and key events for the ML models. It ensures reproducibility, provides a visual exploration of models and the results, enables collaboration and eventually the Continuous Training. Example of the model lineage are events like when a model was trained, created, deployed, re-evaluated, retrained, published to catalogue, as well as model version, tags and description. The tags of these events can be annotations like the project, the user, timestamp and details about original/updated values. Also, model training specific information like variables, properties, scoring, performance metrics, history, function and location should be stored.

The implementation is usually done via libraries that offer recording and retrieving of such metadata associated with ML developer and data scientist workflows. Such a library is an integral part of an ML platform; the TFX reference architecture calls it ML-Metadata (MLMD). This functionality is supported by a database and API to record and retrieve the metadata. Other than the programmatic interaction with this dataset through the API, a dashboard also supports visual exploration.

Other than Continuous Training, the metadata role also supports the function of Hyper Parameter Optimization (HPO) and Neural Architecture Search (NAS). These techniques explore the space of hyperparameters and architectures to find an optimal configuration for a model, and the metadata store supports it by recording the explored combinations.

There is a variety of tools to support this functionality. Some open-source tools focused on experiment tracking like Sacred and DVC, to commercial and cloud solutions like Weights&Biases, Comet and Neptune. The experiment management generalized as model management is also seen as a feature of ml platforms like Kubeflow, Polyaxon, Pachyderm, IBM Watson Studio, Microsoft MLOps, Databricks MLflow and H2O.

3.2 Pipeline orchestration

Machine learning models are usually trained using large datasets. The training process needs to be distributed across multiple processing units in order to finish within the time period required by the business need. For that purpose, computer science brings the concept of execution strategy with data, task and model parallelism.

In data parallelism, the same operation is running in parallel over data partitions. This is optimal for batch algorithms. In task parallelism, different tasks are running in parallel, which makes it optimal for independent subtasks. In model parallelism, also called parameter servers, the model updates are computed over data partitions, and periodically the model is synchronized. This is optimal for mini-batch algorithms.

For the implementation of the execution strategy, the concept of a scheduler is being utilized, a workflow engine for orchestrating parallel jobs. The past growth of the technology around the processing of big data, delivered data workflow engines that have been massively adopted, like Luigi and later Airflow. These workflow management systems are based on a central scheduler that manages the execution on connected workers.

Luigi from Spotify, manages workflows as tasks and handles the dependencies between them, so it creates the form of a pipeline. A task does its job and generates a result, which is taken by the next task as input, etc.

Our orchestration framework, Luigi, was built for data pipelines and had trouble dealing with the unique constraints of machine learning workflows.

In Airflow from Airbnb, the pipeline is defined in the code as a task graph, and the tasks do not have the concept of input and output. The tasks use operators that are programmatically started and can be part of loops that split the execution across our partitioned data. That essentially enables data parallelism.

Given our successful use of Airflow for ETL work, we were excited to use the scheduler for these ML workloads, but encountered numerous issues when using Airflow in a way it was not designed to function. ML DAGs require a much larger degree of task parallelism.

Airflow and Luigi are excellent schedulers for data pipelines, because they have task-based execution, the jobs are executed sequentially. And that is the reason why they’re not a good fit for ml pipelines, where we need data-driven execution.

Tekton and Argo are native Kubernetes pipeline technologies, and as such, orchestrates their resources. The tasks of a Tekton or Argo DAG are containers. The deployment of a DAG is done via an API or CLI, which make the DAG declarative. That enables the freedom of being cloud-agnostic, and moving to another cloud provider in the future is just about applying the declaration.

When a task is executed with the same input/parameters as a previous execution of the pipeline, its results can be fetched from the cache and avoid burning computing time for the same results again and again. This is the reusability capability of the training pipeline.

The selection of a task-driven or a data-driven pipeline orchestrator depends on the case. When dealing with stationary data domain and Continuous Training is not required, the traditional time-based schedulers can be a more efficient choice as these tools are already conquered by data support teams.

3.3 Automatic feature engineering, HPO and NAS

We observe that a slow-down factor in the ML pipeline is the feature engineering, and the time that data engineers spend on generating useful features for the model training.

The feature engineering automation is categorized in both generation and selection of features. Tools that automate the feature engineering include featuretools, TPOT and TSFRESH. These tools are offered through libraries and implemented as data transformation components in the ML pipeline.

From Thorben Jensen’s presentation on “Automating feature engineering for supervised learning”, PyData Berlin 2019

For a particular problem, we might need to train different models with different parameters. That is observed as a delay in the manual ML pipeline. Here comes the automation of the search for the model architecture (NAS) and hyperparameter tuning (HPO). The introduction of the AutoML concept comes to support the exploration and discovery of that configuration that gives the optimal results in a particular case. Instead of having the data scientists considering which NN architecture or which hyperparameter configuration to use, they shift their effort into choosing a search method for the HPO or NAS activity. Now the delay is on the computation, since the exploration space is large and the search process takes a long time.

Most of the tools support Bayesian optimization, Hyperband, Random search, Grid search, NAS with RL, PSO and ASHA. Tools in that space are either individual or part of an all-in-one platform, commercial or opensource, cloud-based or on-premise based. They aim to require low effort in configuration. Worth mentioned tools are Featuretools, Keras-tuner, Hopsworks Maggy and Kubeflow Katib.

3.4 Distributed training

To scale out the training process itself, the concept of the Parameter Server was introduced¹⁰. In that case, both data and workloads are distributed over worker nodes while the server node maintains globally shared parameters, like the weights updates during a NN model training. The first versions of the tensorflow framework brought this functionality to the training process, but required extra effort to change the training code in order to utilize the framework.

Horovod from Uber simplified that effort with the use of the MPI framework for the communication between the workers. Recent versions of the TF framework provide a selection of different strategies to perform the distribution activity, providing in parallel simplicity in usage and switching between different strategies. ML platforms like Kubeflow, Polyaxon and Hopsworks utilize these methods and distribute the workers in kubernetes containers.

3.5 Coding environments and standards

As new ML development frameworks rise and take over traditional tools, it’s not recommended to have standards on the supported languages/frameworks of an ML platform. The platform should be modular enough to support the addition of new frameworks. Tensorflow, PyTorch, MXnet and Scikit-learn though are must-haves.

Literate programming is still a way some data scientists develop models, which became popular again recently due to the adoption of Jupyter notebooks. The ability to run notebooks in the ML platform for training & deployment without necessarily managing the underlying resources is required. Sharing jupyter notebooks across the organization became popular through the knowledge repo, a tool produced by Airbnb to share data stories in the form of blog posts. The developers of the ML pipelines should not be forced to use notebooks though, as data scientists with engineering background are more likely to use standard code modules and development tools instead.

The produced components of the pipelines should be reusable, the way that TFX shows. A data transformation or a cross-validation module in the form of a container should be reusable by other ML pipelines. The coding environment should provide this ability, irrelevant to the programming language or framework used. For example, the data analysis can be done in pandas, the data transformation with spark in scala, the training with pytorch in Python and some visualizations in ggplot in R. To achieve this, it is recommended to use the container as the unit that implements each component of the pipeline.

A codebase structure is also recommended, as we observe with the cookie-cutter-data-science in the open-source community. The aim is to create projects from project templates by using the method of scaffolding. That creates some consistency which enables collaboration. A data scientist from one team can onboard faster on another project. The ML platform team of iFood developed their own codebase templating system, which also enables data scientists to deliver their models, called BruceML. It is doing more than the project scaffolding, so it can be comparable to the arena cli tool of kubeflow.

The peripheral ecosystem of the ML platform should enable data scientists to reuse the learnings of the mature software engineering world. For example, the Continuous Delivery tool should promote GitOps, to allow ml engineers to create declarative configurations and environments in a version-controlled fashion. This way the ML pipeline can be automated, easy to understand and auditable, covering the lifecycle of the model.

Finally, although the underlying resource allocation can be agnostic to the ML pipeline and its developer, the ML platform should allow the management of its resources from within the pipeline. An example is managing the Kubernetes resources of a project’s particular namespace only, from within the jupyter notebook which implements the ML pipeline.

4. Production

4.1 Model serving

The model serving component is responsible for the largest part of the platform’s cost in most organisations.

The process of deploying a model into production is part of its overall lifecycle. It literally involves taking the model file and saving it in a place where user traffic will reach it. The serving component loads the weights and parameters from that model file and exposes an interface to accept traffic, in the form of inference requests.

To optimise the cost of the serving component, the choice of the hosting technology is crucial. Running many models in a virtual machine allows the maximum use of the resources (CPU/GPU memory) for the inference requests. Running a model per container manages the spread of the inference through an orchestration layer.

The serving component itself has model lifecycle management capabilities, to support the rollout of versions. It is specific to the development framework used to produce that model, and each can load specific model (file) types. The unified model format (ONNX) is not massively adopted yet. Leading serving components are the Tensorflow model server, Nvidia TensorRT, Seldon and KFServing.

Serverless helps cost optimization

Having the many versions of the models deployed gives flexibility to the deployment strategy, although the cost of the serving platform is multiplied by the number of different running versions, in the case of containers. Autoscaling is an important feature of the underlying infrastructure, to cope with the demand on peak hours and reduce costs on off-peak hour. The ability to scale down to zero (serverless) for models that do not receive constant traffic contributes to the cost optimization.

Handling model serving in the form of microservices, requires a container orchestrator solution. Key component of the orchestrator is the framework that manages the deployment strategy. A service mesh combines the serving process with the routing process into one entity. That brings the additional benefit of security and monitoring, through the sidecar injection. The overall strategy is coupled with the business experiment management process and tool.

Like traditional data processing lambda architecture, the inference requests can be real-time or in archive mode. Handling such dual nature of traffic requires equivalent storage technology, that abstracts both stream and batch processing in one interface.

Deployed models are often exposed as:

  • gRPC (mostly models which require images as an input, as it’s using protobuf, google’s robust transport method)
  • REST (for traditional ML and NLP models, it’s the universal way of communication based on HTTP)
  • messaging-based decision microservices via API wrappers, queueing and/or streaming services, mostly for batch processing

The data for scoring are usually unstructured in the form of image or text on deep learning models and more structured in the form of transactions on classic machine learning models. The robustness of the transportation for large data (images, text) requires different transfer protocols than the small, structured data. HTTP and JSON offer a simple transport medium, which can be very expensive in large requests. Protobuffers and gRPC offer a robust alternative solution for such transport, which a serving component should support. The choice of the protocol should consider the source of the request, whether it’s an HTTP/gRPC based API or a queue-based system. The latter is commonly used for batch based inference in many OLX cases.

A model serving technology encapsulates the complexity of:

  • Autoscaling
  • Networking
  • Health checking
  • Server configuration

Operational metrics are generated by the serving component and annotated by the model metadata. This brings the observability capability to the platform. These metrics are stored in aggregates of time, usually in 1-second resolution for the most recent days, with a retention policy and rotation schema. The annotation of the model metadata allows operators to break down of the performance metrics by model hyperparameter or architecture and compare. Number of predictions per second and latency on p95 are typical operational metrics. Number of images classified on each class and confidence levels are typical model serving metrics.

A basic requirement of an ML platform is being able to serve models in a scalable and consistent manner.

There is a need to support inference for multiple models with a single, multi-tenant environment to which any organization’s models can be deployed. Different organizations and platforms standardize different types of model artifacts for deployment, including binaries, code, parameters, and full containers.

Kubernetes is a popular choice for distributed inference environments due to its built-in container and microservice management features like routing, load balancing, and declarative autoscaling.

KFServing enables serverless inferencing on Kubernetes. It provides performant, high abstraction interfaces for common ML frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX, as well as prediction, pre-processing, post-processing and explainability out of the box.

The layers of the KFServing tech stack

The KFServing stack relies on Knative for serverless, Istio for service mesh, and Kubernetes for orchestration.

Seldon Core is an open-source model serving framework for Kubernetes that is incorporated into platforms like Kubeflow, IBM Fabric for Deep Learning (FfDL), and Red Hat OpenShift.

Seldon tech stack

Cloud-based ML platform offerings like AWS SageMaker, Microsoft Azure Machine Learning, and Google AI Platform typically support one-click deployment of trained models.

4.2 Data validation, model validation, evaluation and monitoring

A model in production can run in a normal operating range, observed by visualized metrics through the monitoring solutions. When the operational behaviour of the model changes, we have discussed the introduction of the Continuous Training to solve problems like the statistical drift of the training data.

Parts of the ML pipeline include the incoming data schema validation, the model evaluation, the monitoring and the logging of the inference requests. Each of these tasks are separate components in the pipeline.

We have extensively discussed drift detection in the model drift section earlier. Additionally to that, the data schema validation can infer the schema of the incoming data, calculate statistics and detect data anomalies. Using it as a component of the pipeline in the form of the TFDV and storing the results in the metadata store ensures the observability of the model lifecycle.

A model validator as another component of the ML pipeline supports the validation of the exported models, to ensure they are good enough to be pushed to production. It does that by comparing the newly built models of the Continuous Training pipeline against the baseline (such as the current version of the model). The model evaluator performs a deep analysis on the training results of the eval dataset and emits results in the Metadata store. Their performance is compared on the defined metrics (like AUC or loss), and according to criteria, a model is marked as good to move to the next step of the pipeline, the deployment to production.

The mature world of DevOps brings us the best tools to store, consume and visualise metrics and logs.

Time-series databases are common storages for such information, from graphite and tsdb to Prometheus, which also dominates the storage of the metrics of other kubernetes applications. Grafana is the defacto visualisation tool when it comes to operational metrics.

The inference requests together with the responses, are stored in the form of logs. These logs represent the current knowledge (request) and the new knowledge (response), which is the essence of our business. For log storage and analysis, both open source and commercial solutions can provide this functionality, like ELK stack and Splunk. For each of that scoring (or inference) activity (hence metric & log entry), there should be a reference to the metadata of the used model. Modern platforms capture these logs in the form of events, following the specification of Cloudevents. Such events can be delivered through various industry-standard protocols (e.g. HTTP, AMQP, MQTT, SMTP), open-source protocols (e.g. Kafka, NATS), or platform/vendor-specific protocols (AWS Kinesis, Azure Event Grid). The all-in-one platforms have the open-source tools integrated.

There’s a few different ways you can monitor your predictions. The ideal way is where you can actually log the predictions that you make, and then join them back to the outcomes that you observe as part of the running of the system later on, and then see whether you got the prediction right or wrong¹¹.

Questions a data scientist might have when observing a model in production include:

  • What is the mean/std of the resulting predictions?
  • What is the relative frequency of the various values that categorical variable X can take?

4.3 Deployment strategies

Platforms can help ensure safe and effective model deployment by

  1. automating the implementation of phased deployments and
  2. the management of online experiments

These two share direct analogs with the deployment strategies DevOps teams have been refining to support the safe rollout of microservices. The following three techniques rely on version-aware automated model deployment mechanisms, as well as load balancing and traffic management features provided by the platform.

  1. A/B testing allows for competing models to be tested on live traffic.
  2. Blue-green testing allows for the quick rollback of new models that fail on deployment.
  3. Canary deployments allow for the testing of new models with small amounts of traffic until they’re validated.

A deployment framework handles the traffic coming to the model using a strategy. It can be the full production traffic, a part of it, or a mirror. The simple method is switching the traffic of the requests to the new version of the model. Advanced management of the selection between different model versions is introduced with a deployment strategy. Deployment strategies include blue/green deployments, A/B testing and canary releases. Blue/green is the deployment method used to build a fully blown image of the production with the new versions of the components and switch over the traffic when ready. A/B testing strategy is focused on experimenting with new features. In canary, a new version of the model is released to a small set of the inference requests, and the results are being monitored to observe the difference from the previous version. These three strategies can also coexist.

As discussed earlier on the model serving, the service mesh is the modern way of implementing these strategies. Leading service meshes are Istio, Linkerd and Consul. A comparison table with the features of a service mesh can be seen in weaveworks flagger repo.

Conclusion

Building an MLOps platform is an action our portfolio companies take in order to accelerate and manage the workflow of their data scientists in production. This workflow is reflected in ML pipelines, and includes the 3 main tasks of feature engineering, training pipelines and serving in production.

For feature engineering, a low latency access channel is required to enable fast inference, and a cost-optimised storage is required to power such a frequent model retraining process. These are the basic building blocks of a feature store.

Model training is a task that requires a pipeline orchestrator, as they have dependencies of subsequent tasks, and that makes the whole pipeline prone to errors. Software building pipelines are different from data pipelines, which are in turn different from ML pipelines. In the machine learning flow, the ML engineer writes code to train models, uses the data to evaluate them and then observes how they perform in production in order to improve them. Hence the implementation of such a workflow requires a combination of the orchestration technologies we’ve discussed above.

TFX present this pipeline and proposes the use of components that perform these subsequent tasks. It defines a modern, complete ML pipeline, from building the features, to running the training, evaluating the results, deploying and serving the model in production.

Kubernetes is the most advanced system for orchestrating containers, the defacto tool to run workloads in production, the cloud-agnostic solution to save you from a cloud vendor lock-in and hence optimize your costs.

Kubeflow is positioned as the way to do ML in Kubernetes, by implementing TFX. It provides a coding environment by implementing jupyter notebooks in the form of kubernetes resources. The cloud providers are also contributors in this project and implement their data loading mechanisms across KF’s components. The orchestration is implemented via KF pipelines and the serving of the model via KF serving. The metadata across its components are specified in the specs of the kubernetes resources throughout the platform.

This native way of technology integration is what makes us classify Kubeflow as the crucial design pattern recommended to build an ML platform and deliver high-quality ML projects.

I’d like to thank my colleagues from the Prosus AI team, for their suggestions and help in structuring it in a blog post format. Please feel free to ask questions / provide suggestions in the comments section or reach out to us at datascience@prosus.com.

--

--