Mastering MLOps: A Step-by-Step Guide with VESSL

Published in

VESSL AI

6 min readJul 8, 2024

In today’s rapidly changing technological landscape, Machine Learning is a potent asset that drives innovation and efficiency in business operations. From disease and fraud detection to demand forecasting and self-driving cars, ML is transforming industries by tackling complex problems and enhancing decision-making capabilities. It is not a set-it-and-forget-it process; it requires constant monitoring and updating to create models that can shift and adapt to the ever-changing world. With the dynamic nature of ML, the need for a streamlined process of creating, managing, and deploying models has placed MLOps at the center of the technological ecosystem.

What is MLOps?

MLOps, short for Machine Learning Operations, is a set of processes that support sustainable machine learning development and operation. It encompasses the entire ML lifecycle, from data loading to model deployment and monitoring. MLOps seamlessly integrates automated systems into the ML lifecycle, ensuring scalability, security, and efficiency.

Role of DevOps, DataOps, and ModelOps in MLOps

The full inventory of MLOps capabilities blends elements of DataOps, ModelOps, and DevOps for each step of the ML lifecycle.

DataOps is the practice of efficiently managing different types of data, structured or unstructured, ensuring quality and reliability. This process includes storing, labeling, and versioning data and data extraction, transformation, and loading (ETL).

ModelOps focuses on training, deploying, and monitoring ML models, offering a high level of model governance. It allows for version control through the model registry and experiment tracking to ensure models remain high-performing and up-to-date.

As for DevOps, MLOps is often defined as a combination of ML + DevOps as the MLOps process heavily relies on common DevOps practices. Similar to DevOps, MLOps automates the workflow required for development, monitoring, and updating. Principles like CI/CD, clustering, GPU resource provisioning, job scheduling, and monitoring enhance MLOps’ scalability and robustness.

https://blogs.nvidia.com/blog/what-is-mlops/

Different platforms offer varying combinations of these three elements depending on the scope and specialization of the platform. For example, platforms like VESSL emphasize the fundamentals of ModelOps and DevOps, providing essential tools for model training and deployment.

Step-by-Step MLOps with VESSL

The end-to-end pipeline of MLOps can be broken down into 5 key steps. We will take a look at each of these steps and how they can be implemented on the VESSL platform.

1) Computing Resource

The first step in any end-to-end MLOps pipeline is to secure adequate computing resources. Clusters, or sets of computing nodes/computers, are often used for such projects and are managed by systems such as Kubernetes. Clusters can be hosted on the cloud or on-premise; while on-premise clusters have a fixed number of nodes, cloud clusters can implement a node pool to easily scale up resources as more instances are created.

VESSL primarily manages three types of clusters: VESSL-managed clusters, user cloud clusters, and on-premise clusters. This allows for hybrid integration of clusters from any cloud providers such as AWS and GCP or on-premise servers. A VESSL agent is installed on each node of the cluster to collect system metrics of the machine learning workload, providing detailed reports of node status for each cluster, thereby streamlining troubleshooting. Additionally, the agents can automatically notify the API server to scale up or down as needed depending on metrics such as GPU usage and workload count.

2) Data Preprocessing

The next step is data preprocessing and data ETL (Extract, Transform, and Load) to prepare raw data for use. ETL is the process of extracting the necessary information from image, text, audio, etc. data to derive meaningful features.

In this phase, we set up the environment needed to use the data in the subsequent steps through VESSL Dataset and Workspace. VESSL supports four types of datasets:

Local Files: Upload directly to VESSL Storage.
AWS S3: Manage datasets stored on Amazon Web Services.
GCP Google Cloud Storage: Manage datasets stored on the Google Cloud Platform.
On-Premise Datasets: Utilize datasets stored on local servers.

To create a Workspace instance, you can bring in container images or resource specs to preinstall necessary packages, Jupyter, SSH, and other tools. The workspace can then be accessed via Jupyter, with Terminal through SSH, or integrated with VSCode.

3) Model Training and Hyperparameter Tuning

Once the data is prepared, the model can be trained. VESSL provides two services required for this step: VESSL Run and VESSL Artifact.

VESSL Run allows you to train, fine-tune, and scale ML models at ease through a unified YAML interface, simplifying the setup and management of training and resource allocation.

VESSL Artifact provides a way to save the datasets and model checkpoints during the runs. It ensures that these artifacts can be restored later, facilitating reproducibility and continuity in ML projects.

Run and Artifact operate as a pair, enabling Run to export the datasets and trained models to Artifact that can later be imported to Run again.

In addition to VESSL Artifact, you can import and export to create the Runs from various external sources, including uploading code from GitHub, GitLab, or Bitbucket and using VESSL or Hugging Face datasets and models. You can also directly mount on-premise datasets and object storages through the Network File System (NFS), hostPath, or the recently-introduced FUSE service from Google Cloud and AWS. After training, models can be uploaded to Artifact, Dataset, or object storages on AWS or GCP.

4) Model Registry and Serving

The fourth step is Model Registry and Serving. After training the model, it can be stored and version-controlled in the model registry, where it can be served, retrained, or fine-tuned. The model registry allows easy comparison and analysis of models by storing evaluation metrics and performance indicators.

VESSL Serve provides the infrastructure necessary for deploying trained models in the model registry or any open-source model from Hugging Face. It abstracts the complex steps of creating model API servers, offering easy-to-use API endpoints for production deployment.

5) Pipeline

The final step is to manage the pipeline, encompassing the full development and deployment process. The pipeline is at the core of MLOps, forming an organized, iterative process to track workflows and performance and scale as needed.

The VESSL Pipeline provides end-to-end CI/CD that facilitates a seamless and efficient ML workflow, from initial data preparation to final model deployment, enhancing productivity and the ability to respond quickly to new insights or changing requirements. It integrates multiple steps of the Run-Artifact pairs to continuously check model performance, all of which can be customized to meet business needs. With increased flexibility, the pipelines can be reinforced with human interaction for more sophisticated and high-quality evaluation and be triggered manually or timed through cron jobs or with Webhooks.

Going Forth

We’ve now taken a look at what MLOps is and the five key steps involved in the process. As ML continues to establish itself at the heart of technological innovation, MLOps is set to play an increasingly vital role in the success of ML projects. Likewise, services like VESSL are at the forefront of MLOps innovation, offering comprehensive solutions to some of the most crucial challenges faced by ML practitioners and enterprises. Leveraging the capabilities of MLOps will be a key differentiator for industry-leading organizations to thrive in the machine learning landscape.

—

Heidi Seo, Growth Intern