MLOps @ Data Reply — PART I

Artiom Diana
DataReply
Published in
6 min readApr 12, 2023

Like it or not, if you are using a computer, you are most probably also using some form of Artificial Intelligence (AI). From search engine ranking systems to spam filters, to social media content recommendation to, more recently, text generation with systems like ChatGPT, Machine learning (ML) has gained significant popularity and more organisations are investing in building and deploying ML models.

However, managing ML models can be a daunting task. Companies using AI usually deal with large datasets and those datasets require a complex, robust, and reliable informative system. Developing and maintaining a system which interacts with Artificial intelligence leads to multiple challenges:

  • Data management: Machine learning models need large volumes of high-quality data to be trained effectively. This means that organisations need to establish robust data management processes to collect, clean, and store data.
  • Feature construction: The process of feature construction for Machine Learning Model development can be time consuming and complex. In addition, in cases of complex transformations there is significant room for the introduction of deviations in the construction of features between ML training and inference, or between code run in different environments.
  • Model selection: With so many different machine learning algorithms and techniques available, it can be difficult to understand which solution better fits the business’ needs.
  • Model training: Training machine learning models can be computationally intensive and time-consuming; managing different model versions can become challenging.
  • Model deployment: Once a machine learning model has been trained, it needs to be (re)deployed to production environments where it can be used to make predictions.
  • Model versioning & tracking: An efficient way to track the performance of models during training and validation, and to efficiently and robustly deploy different models in different environments, is important to both model development and reliable production deployment. This is particularly true as the number of use cases and development teams increases.
  • Model Orchestration & Scaling: The horizontal and vertical auto-scaling and, for batch workloads, the triggering of models needs to be aligned with the arrival of new data.
  • Model monitoring: Even after a machine learning model has been deployed, it needs to be monitored to ensure that it continues to perform well.
  • Collaboration: Machine learning projects often involve multiple teams with different areas of expertise. Team members should be able to work on distinct parts of the system without conflicting with their peers.

Addressing these challenges can be time-consuming and stressful. Keeping all the previous steps manual can quickly become unmanageable and costly. It is desirable instead to have an automated, reproducible, and stable pipeline that does all the steps for you.

This is where MLOps comes in. MLOps, short for Machine Learning Operations, refers to a set of practices, processes, and tools designed to streamline the lifecycle of ML models. It encompasses everything from building and training ML models to deploying, monitoring, and maintaining them in production. As the image shows, MLOps is all about Data, Software development and Operations.

Figure 1: What is MLOps? Source: https://www.phdata.io/blog/mlops-vs-devops-whats-the-difference/

MLOps provides tools and processes to help organisations manage data effectively, ensuring that models are trained on the best possible data. It can help evaluate different models and select the one that is most appropriate for the use case. It can also optimise model training, reducing the time and resources required to get accurate results in a scalable way, ensuring that the trained models perform well in production environments. Additionally, MLOps enables monitoring models and the identification of issues before they become serious problems.

There are several components that help us achieve all the above. An MLOps system comprises a set of components that work together to automate and optimise the end-to-end lifecycle of ML models as described below.

Figure 2: MLOps components

In Figure 2 a segregation of different components in an ML System are shown. The first box provides an overview of ML Life cycle/workflow, and the second box — the components and processes of this workflow. The ML side may be more familiar — it involves preparing data and then moving onto training. As a common rule — if something could fail, it will fail — therefore it is better to store the model artifacts in an Artefact Storage during the training. Upon careful evaluation (which could also be automated to a degree), you then provide the chosen model artefacts and metadata to the serving component and post processing in place. The second box shows the tools needed at distinct stages of the ML Engineering pipeline.

Different components are addressing different challenges in ML systems, such as:

  • Data Management: Collecting, cleaning, labelling, and storing data for training and testing ML models.
  • Feature Management: Constructing and storing features for ML models so that they can be efficiently and reliably shared between different development teams, and between model training and inference pipelines.
  • Model Training/Development: Building, testing, and validating ML models using various algorithms and techniques.
  • Model Versioning and Tracking: Trained models objects need to be effectively stored and their associated metadata tracked. This allows for clear traceability and reproducibility during model deployment. It also allows development teams to effectively track progress in model development efforts.
  • Model Deployment: Deploying trained models to production environments, such as cloud platforms, edge devices, or on-premises servers.
  • Model Orchestration and Scaling: Triggering and scaling model deployments based on the arrival of new data or application load.
  • Model Monitoring: Monitoring the performance and behaviour of the deployed models to detect and prevent issues such as data drift, model decay, and security breaches. This helps to improve the overall performance and reliability of our models, reducing downtime and minimising the impact of issues on your business operations.
  • Model Governance: Ensuring compliance with regulatory requirements, ethical standards, and best practices for ML model development and deployment.
  • CI/CD: Ensuring that code, model, and infrastructure changes are continuously tested, integrated, and deployed to production environments in a seamless and automated manner, reducing the risk of human error and ensuring consistency in the deployment process.
  • Environment Separation: Separating development, testing, and production environments to ensure that changes are thoroughly tested and validated before being released into production. This kind of environment setup enables the test of new hypotheses in “sandboxes” that are not going to affect clients’ businesses.
  • Access Control: Implementing security measures to control who has access to sensitive data, models, and infrastructure to prevent unauthorised access, protect sensitive data, and reduce the risk of data breaches or cyber-attacks.

MLOps is an emerging field, enabling engineers to write reproducible, scalable, secure, and production ready ML pipelines. More and more tools for one or more MLOps components are developed each year. We believe that we will continue to see further growth, as the ML use-case landscape is diversifying with new developments in the AI industry and more problems becoming solvable via AI. The future looks exciting!

Summary

In this article we provided an introduction to Machine Learning Operations (MLOps). MLOps is a set of practices and tools for developing and deploying machine learning models efficiently, reliability, and at scale. MLOps is used to ensure that the deployed models are monitored correctly, and that development and operations teams work together effectively. Some of the key components of MLOps include version control, testing and validation, reproducibility, deployment automation, monitoring and logging, collaboration and communication. By implementing MLOps best practices, organisations can increase the speed, reliability, and accuracy of their machine learning systems, ultimately delivering greater value to their customers and stakeholders.

Now that you know what MLOps is, stay tuned for our next article (Part II), where we showcase our Accelerator project and elaborate on how we apply MLOps best practices at Data Reply to fast track bringing value for our customers.

--

--