Fundamentals of MLOps — Part 1 | A Gentle Introduction to MLOps

Tezan Sahu
Sep 5 · 17 min read

Ever wondered how organizations build, deploy, maintain, adapt, retrain and redeploy large-scale AI-powered applications? In today’s fast-paced industry, maintaining and deploying scalable applications while being able to adapt quickly to the changing consumer requirements is of utmost importance.

Through this 4-blog series on Fundamentals of MLOps, you will be introduced to some of the core ideas behind combining the long-established practices of DevOps with the emerging field of Machine Learning. You will be exposed to the various stages of an ML model lifecycle, including data versioning, experimentation, evaluation & monitoring. To consolidate these principles, you will also get an opportunity to build & deploy end-to-end ML pipelines by leveraging various ML Operations Management tools & frameworks such as DVC, PyCaret, MLFlow & FastAPI.

By the end of this series, you would also have trained, experimented with & deployed a production-ready ML model on AWS, & served it using FastAPI.

Contents

  • The Rise & Adoption of DevOps Principles
  • MLOps Demystified
  • ML Workflow Lifecycle
  • MLOps Principles
  • Benefits of MLOps Solutions
  • Challenges Associated with MLOps
  • Tools & Infrastructure for MLOps
  • Adoption of MLOps
  • Concluding Remarks
  • Additional References

Introduction

  • It takes far longer to deploy ML models to production than it does to create them
  • The actual ML code makes up just a small portion of real-world ML systems, while the surrounding infrastructure in the production environment is extensive and complicated
Image Source: Hidden Technical Debt in Machine Learning Systems

Historically, ~85 % of ML models that are built never reach production. Surveys & reports also suggest that only ~60% of projects make it from prototype to production — that too at organizations that have a decent experience with AI. Moreover, the delivery durations of such products & services are usually defined in months whereas ideally they should be measured in hours (or days, at max). According to the 2020 State of Enterprise Machine Learning report, the main challenges faced by people developing systems with ML capabilities are scale, version control, model reproducibility, and aligning stakeholders.

These studies should be enough to drive home the fact that straightforward as it may seem, bringing ML research in academia to build production-ready & usable ML systems requires far more considerations than one may otherwise imagine.

Ultimately, ML systems boil down to developing and implementing computer code. Thus, in spite of some stark contrasts, it should come as no surprise that the management of these systems is comparable to traditional software development. To understand how one can improve the operationalization (deployment & maintenance) of ML systems (through MLOps), we first need to understand some of the basic principles of DevOps, which have proved to increase an organization’s ability to deliver applications and services faster than traditional software development processes.

The Rise & Adoption of DevOps Principles

DevOps refers to a set of practices & tools that combine the development (Dev) & IT operations (Ops), with the primary goal of optimizing the flow of value from an idea to the end-user. This is usually achieved through:

  • Shortening the software system development lifecycle
  • Providing continuous delivery with high software quality

DevOps typically aims to overcome the institutional divide between the team that writes the code (Dev) and the team that manages the infrastructure and tools used to run and manage the product (Ops).

DevOps Practices & Tools

  • Continuous development
  • Continuous testing
  • Continuous integration (CI)
  • Continuous delivery
  • Continuous deployment (CD)
  • Continuous monitoring
  • Infrastructure as code

Followers of DevOps practices often use certain DevOps-friendly tools as part of their DevOps “toolchain”. The goal of these tools is to further streamline, shorten, and automate the various stages of the software delivery workflow. The image below clearly depicts the DevOps lifecycle stages & the important tools associated with each stage.

Image Source: DevOps without DevOps Tools

The DevOps Culture and Mindset

Image Source: Created by Author

Although DevOps may refer to a lot of technical solutions, in order to do those practices, we need people. Hence, to implement such solutions, it is crucial to first focus on people, collaboration, and mindset for successful DevOps implementation. This could be a big shift in thinking. Following are some of the essential values for a DevOps mindset:

  • Focus on our stakeholders and their feedback rather than simply changing for the sake of change
  • Strive to always innovate and improve beyond repeatable processes and frameworks
  • Inspire and share collaboratively instead of creating a silo
  • Measure performance across the organization, not just in a line of business
  • Promote a culture of learning through lean quality deliverables, not just tools and automation

Adoption of DevOps

DevOps is here to stay — for very good reasons. Many believed it was impossible, yet DevOps has succeeded in bringing together business users, developers, test engineers, security engineers, and system administrators in a unified process focused on satisfying client needs.

Motivation for Combining ML & DevOps

MLOps Demystified

Wikipedia defines MLOps as:

MLOps is the process of taking an experimental Machine Learning model into a production system.

Although at first, it may seem as if MLOps = ML + DevOps (because it pulls heavily from the concept of DevOps), yet that isn’t the most accurate representation of MLOps. In spite of their similarities, there is a critical difference setting DevOps & MLOps apart: while software code is static (relatively), data is always changing, which means ML models must constantly be learning and adapting to newer inputs. The complexity of this environment, as well as the fact that machine learning models are composed of both code and data, is what distinguishes MLOps as a new and distinct field.

Since Data Engineering provides important tools and concepts that are indispensable to solving the puzzle of ML in production, MLOps can be defined as follows:

MLOps is a set of practices that lies in the intersection between ML, DevOps & Data Engineering, which aims to deploy and maintain ML systems in production reliably and efficiently

Image Source: The Road to MLOps: Machine Learning as an Engineering Discipline

Some other ML-specific challenges that MLOps caters to are mentioned below:

  • Data & Hyperparameter versioning
  • Iterative experimentation & evaluation of models
  • Production monitoring to ensure the performance of the model with new/unseen data
  • Dynamic scaling of computing power (infrastructure) in production

With the plethora of tools & opportunities that it provides for building & deploying end-to-end ML systems, MLOps is gaining a lot of traction among Data Scientists, ML Engineers, and AI enthusiasts. While MLOps is relatively nascent, the data science community generally agrees that it is an umbrella term for best practices and guiding principles around machine learning & not a single technical solution.

ML Workflow Lifecycle

Image Source: ML Lifecycle: High-Level Overview
Image Source: The ML-Lifecycle: Detailed View

It is evident that a classic ML workflow comprises 3 major phases, namely Data Preparation, Model Training & Tuning (this is the core of an ML workflow), and Deployment & Monitoring. Next, we move on to dive slightly deeper into the steps associated with each of these phases.

Before understanding the 3 major phases of ML workflows, it is imperative to first develop a thorough understanding of the business case & critically evaluate the requirement of ML to address the problem statement.

Data Preparation

  • Exploratory Data Analysis: It refers to the essential process of conducting preliminary studies on data in order to unravel patterns, identify anomalies, test hypotheses, and validate assumptions using summary statistics and graphical representations.
  • Data Wrangling: It is the act of cleansing and integrating chaotic and complicated data sets for easy access and analysis. It also involves the correction of errors in data such as missing value imputation & treatment of outliers.
  • Data Labeling & Annotation: It is the act of adding one or more relevant and useful labels to raw data (pictures, text files, videos, etc.) and so as to offer context for the training of a machine learning model.
  • Feature Engineering: It is the process of extracting features (characteristics, traits, and attributes) from raw data using domain expertise. It is an essential step for ML algorithms, while it may not be explicitly required for deep learning tasks (since in DL, essential features are expected to be automatically inferred & used by the layers of the neural network)
  • Data Splitting: It involves splitting the data into training, validation, and test datasets to be used during the core machine learning stages to produce the ML model

Model Training & Tuning

  • Model Training: It is the process of applying the machine learning algorithm on training data to train an ML model.
  • Model Validation: It is the process of comparing model outputs to independent real-world observations systematically in order to assess the quantitative and qualitative correspondence with the reality, before serving it in production to the end-user.
  • Hyperparameter Tuning: A hyperparameter is a model argument whose value is set before the learning process begins. Hyperparameter tuning involves the iterative approach of choosing a set of hyperparameters for a learning algorithm until optimality is achieved.

Deployment & Monitoring

  • Model Packaging: It is the process of exporting the fully trained ML model into a specific format (like PMML, PFA, ONNX, etc.) so that it can be consumed by the end-user.
  • Model Serving: It refers to the process of deploying the packaged ML model in a production environment. This can be achieved in 2 major ways:
  • Model-as-a-Service: The model is deployed into a simple framework to provide a REST API endpoint that responds to requests in real-time
  • Embedded Model: The model is packaged into an application & then published
  • Performance Monitoring: It is the process of observing the ML model performance based on live and previously unseen data so as to capture signals that trigger the need for potential retraining of the model.

MLOps Principles

Reproducibility: It ensures that given the same input, each phase of data processing, ML model training, and ML model deployment should yield similar outputs. It involves versioning of not only the code but data, hyperparameters & other metadata as well, along with effective documentation at each of the stages of the workflow. This allows every production model to be audited & reproduced. Some of the key practices that ensure reproducibility are Versioning & Experiment Tracking

Collaboration: Successful implementation of MLOps, like DevOps, involves people working together — the collaboration is usually between data scientists, ML engineers, business analysts and IT operations professionals. MLOps encourages teams to make transparent the whole process of creating an ML model, from data extraction through model deployment and monitoring.

Scalability: MLOps enables organizations to scale in order to address critical issues by making ML initiatives more efficient and effective. This implies the enhancement in the ability to train a larger number of models & also use models with high-scale data in production.

Continuous X: The lifecycle of a trained model is fully determined by the use-case and the dynamic nature of the underlying data. Without continuous processes, data scientists will have to spend a lot of effort each time developing manual and ad-hoc models. MLOps encourages the following practices:

  • Continuous Integration (CI) adds testing and validating data and models to the testing and validating code and components.
  • Continuous Delivery (CD) refers to the delivery of an ML training pipeline that deploys another ML model prediction service automatically.
  • Continuous Training (CT) is a characteristic specific to ML systems that automatically retrains ML models for re-deployment.
  • Continuous Monitoring (CM) is concerned with the monitoring of production data as well as model performance indicators that are linked to business metrics of end-users

Automation: The objective of MLOps is to ensure reduced time & cost of pushing models into production. This translates to automating the end-to-end ML workflow pipeline without any manual intervention, which can be achieved in the form of automated triggers. Automated testing helps to discover problems quickly and in the early stages enabling fast fixing of errors.

Image Source: Verta launches new ModelOps product for hybrid environments

The following table summarizes how the various principles of MLOps can be applied to the usual ML workflow lifecycle stages.

Benefits of MLOps Solutions

  • Rapid innovation through robust machine learning lifecycle management
  • Creation of reproducible workflows and models
  • Clear direction & measurable benchmarks provided to data scientists
  • Easy deployment of high precision models in any location
  • Machine learning resource management system and control
  • Open communication between data science teams and operations team, leading to opening up of bottlenecks
  • Effective governance of data & process
  • Improvement in quality of models due to continuous & focused feedback
  • Rigorous automated testing & validation ensures that model bias is removed while improving the explainability

Evaluating the Effectiveness of MLOps

  • Deployment Frequency: It depends on model retraining requirements & level of automation of the deployment process
  • Lead Time for Changes: It depends on the duration of exploratory data analysis, model selection & training, and the number of manual steps
  • Mean Time to Restore (MTTR): It depends on the number and duration of manually performed model debugging & model deployment steps
  • Change Failure Rate: It can be expressed as the difference between the currently deployed ML model performance metrics to the previous model’s metrics

Challenges Associated with MLOps Adoption

  • Organizations being hesitant to incorporate machine learning into processes since it is difficult to rely on models in locations where people used to operate in the past.
  • The assessment and consideration of model risks when implementing a machine learning model
  • Lack of specialists in the existing training market who are equally well versed in competencies at the intersection of Data Science, DevOps and IT
  • Dependence on the various tools might cause people to get addicted to those tools which provide with short-term benefits over those that provide long-term benefits
  • Negligence of test automation and more focus on CI/CD deployments

The key to overcoming these issues is to be aware of them in order to build a solid knowledge base, a broad perspective, and to learn to use MLOps principles and practices, which will aid in the implementation of a fully automated integrated framework and, as a result, enhance business verticals.

Tools & Infrastructure for MLOps

Here is a curated list of MLOps tools & frameworks that cater to the various portions of the MLOps landscape. While browsing through the list, it is important to note that the MLOps landscape is rapidly evolving, with newer & more advanced tools being developed for specialized applications. The image below summarizes some of the most popular MLOps frameworks currently adopted in the industry to develop the ML infrastructure:

Image Source: Maximizing ML Infrastructure Tools for Production Workloads — Arize AI

In the upcoming parts of this series, we will dig deep into some of these tools (DVC, PyCaret & MLFlow) to gains hands-on exposure & use them for our own projects.

MLOps Technology Stack Template

Image Source: The MLOps Stack

Towards the end of this series, we will be in a position to revisit this template & fill it up with the various tools that we learn along the way.

You can download this template for personal use as well.

Adoption of MLOps

  • Netflix: It uses an in-house ML framework-agnostic library called Metaflow to rapidly experiment by training machine learning models and effectively managing data. Using the Metaflow API, their ML workloads seamlessly interact with AWS Cloud infrastructure services. It also uses an internal tool called Runway to manage all models in production & automatically alert the ML teams for models that are stale in production.
  • Uber: Teams at Uber operationalize their ML models through an internal ML-as-a-service platform called Michelangelo, which enables them to seamlessly build, deploy, and operate ML solutions at scale. They also use Manifold — a model-agnostic visual debugging tool.
  • Facebook: It has developed a brand-new platform, FBLearner Flow, that is capable of effortlessly reusing algorithms across products, scaling to perform thousands of concurrent custom experiments, and organizing experiments with simplicity. It can ingest trillions of data points every day, trains thousands of models (either offline or in real-time) and then deploys them to the server fleet for live predictions.
  • Carbon: Carbon use DataRobot to create comprehensive credit risk models, saving them a whole end-to-end process and allowing the firm to focus on sourcing the correct data and making other decisions that assist drive their business ahead.

MLOps demands the understanding of data biases as well as a strict discipline within the company that decides to adopt it. As a result, each organization should build its own set of practices for adapting MLOps principles to its AI development and automation.

MLOps-Related Conferences

Concluding Remarks

In the next parts, we will get our hands dirty by implementing some of the MLOps practices that we saw in this post, using various tools & frameworks.

Following are the other parts of this Fundamentals of MLOps series:

Thank you & Happy Coding!

About the Author

Website: Tezan Sahu | Microsoft
LinkedIn: Tezan Sahu | LinkedIn
Email ID: tezansahu@gmail.com

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Tezan Sahu

Written by

Data & Applied Scientist at Microsoft | B. Tech in Mechanical Engineering (Minor in CS) from IIT Bombay | GSoC’20 with PEcAn Project

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com