ML-OPS -An Introducion

daisy okacha
4 min readOct 6, 2022

--

Photo by Pankaj Patel on Unsplash

The machine learning lifecycle has many stages to it and every stage can have a different setup of data, ETL process, ML models, and numerous parameters associated with it. Moreover, there are multiple teams at every stage. This eventually results in the need to track all the changes and time stamping it to track. This has given rise to MLOps where we, not only track the code but the data, model, parameters, model metrics, plot, and much more.

Look at it as DevOps but for data!

We will look at the following tools:

  1. DVC

2. MLflow

3. CML

1.Mlflow

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry

It helps to track your ML experiments, including tracking your models, model parameters, datasets, and hyperparameters and reproducing them when needed.

Mlflow components

These components include:

  • Tracking: This allows you to track experiments to record and compare parameters and results.
  • Models: This allows you to manage and deploy models from various ML libraries to various model serving and inference platforms.
  • Projects: This allows you to package ML code in a reusable, reproducible form to share with other data scientists or transfer to production.
  • Model Registry: This allows you to centralize a model store for managing models’ entire lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating.
  • Model Serving: This allows you to host MLflow Models as REST endpoints

Advantages

  • It is easy to set up a model tracking mechanism in MLflow.
  • It offers very intuitive APIs for serving.
  • It provides data collection, data preparation, model training, and taking the model to production.
  • It provides standardized components for each ML lifecycle stage, easing the development of ML applications.
  • It can easily integrate with the most popular tools that data scientists use.
  • You can deploy MLflow models to various existing tools, such as Amazon Sage Maker, Microsoft’s Azure ML, and Kubernetes.
  • It helps us save the model along with the parameters and analysis.
  • MLflow models give a standard format for machine learning model packaging.

Disadvantages

Following are some of the disadvantages of MLflow:

  • You can’t easily share experiments or collaborate on them.
  • MLflow does not have a multi-user environment.
  • Role-based access is not present.
  • It lacks advanced security features.
  • The addition of extra workings to the models is not automatic.
  • It is not easy and ideal for deploying models to different platforms.

2. Data Version Control(DVC)

DVC is an open-source version control system used in machine learning projects. It is also known as Git for ML. It deals with data versions rather than code versions. DVC helps you to deal with large models and data files that cannot be handled using Git. It allows you to store information about different versions of your data to track the ML data properly and access your model’s performance later. You can define a remote repository to push your data and models, granting easy collaboration across team members.

Here are some of the reasons to use DVC:

  • It enables ML models to be reproducible and share results among the team.
  • It helps to manage the complexity of ML pipelines so that you can train the same model repeatedly.
  • It allows teams to maintain version files for referencing ML models and their results quickly.
  • It has the full power of Git branches.
  • Sometimes, team members get confused if the datasets are incorrectly labelled according to the convention; DVC helps label datasets properly.
  • Users can work on desktops, laptops with GPUs, and cloud resources if they need more memory.
  • It aims to exclude the need for spreadsheets, tools, and ad hoc scripts to share documents for communication.
  • You use push/pull commands to move consistent bundles of ML models, data, and code into production, remote machines, or a colleague’s computer.

Advantages

  • Along with data versioning, DVC also allows model and pipeline tracking.
  • With DVC, you don’t need to rebuild previous models or data modelling techniques to achieve the same past state of results.
  • Along with data versioning, DVC also allows model and pipeline tracking.
  • It is easy to learn.
  • You can share your models via cloud storage, making it easier for teams to perform experiments and optimize the utilization of the shared resources.
  • It becomes easy to work with tons of models and data metrics as DVC reduces the effort to differentiate which model was trained with which data version.
  • It maintains distinguished reports in rotation.
  • DVC works with local files, so it solves the problem of file naming for multiple versions.

Disadvantages

  • DVC has a tight coupling with pipeline management, which means if the team is already using another data pipeline tool, there will be redundancy in maintaining data.
  • DVC is lightweight, which means your team might need to manually develop extra features to make it easy to use.
  • There is a risk of incorrect pipeline configuration in DVC if your team forgets to add the output file.
  • Checking for missing dependencies in DVC is quite challenging.

Continuous Machine Learning (CML)

CML is a GitHub action and a series of docker containers that help you run jobson runners (instances) on GitHub, a cloud platform of your choice, or machines in your local network. A workflow using EC2 instances triggered by GitHub would look like this:

  1. You push a commit to GitHub with a code change or a hyperparameter tweak
  2. GitHub launches the CML action which uses terraform to provision a cloud instance to your specification (the runner)
  3. CML runs a docker container from pre-prepared CML images (or your custom image)
  4. Your arbitrary commands are executed in the docker container using the source code from your repo (This can be a DVC pipeline)
  5. Any artefacts produced by your code can be stored in DVC in the usual way and metrics reported automatically back into the PR.

--

--