ML Mesh: Distributing MLOps across your Organisation

Why and What

Sergio Zavota
17 min readAug 8, 2023

Here we go again, another buzzword”.

I had this thought when I first wrote this blog post title, and it might be that the same happened to you.
It could be because, in the last few years, many new terms popped up in the Data & AI space (MLOps, DataOps, AIOps, Data Mesh and more). While, for some of them, it’s still unclear what they mean, for others, principles and best practices have been slowly emerging and becoming a globally acknowledged standard.

MLOps is one of those topics for which we have a better idea of what it means, however, after studying its concepts and successfully implementing solutions for different customers, a question often arose: “How can we scale this solution to be adopted by different data science teams, belonging to different units across the organisation?”.

I didn’t find an answer in the literature that I could find satisfactory and exhaustive. In my opinion, the current MLOps principles and processes describe how to reliably manage the lifecycle of ML models, from conceptualisation to monitoring. Still, they don’t mention how the solution should scale across the organisation.

This post aims to introduce a new approach called ML Mesh that enriches MLOps principles to address the scalability challenge from an organisational perspective.

This first post answers the following questions:

  • Why do we need to extend MLOps with other concepts, like ML Mesh?
  • What is ML Mesh, and what are its foundational principles?

Other blog posts will follow and go more in-depth over different aspects, up to a hypothetical logical architecture and, maybe, a physical one.

Finally, I assume the reader is already familiar with MLOps and its principles; hence I’m not going to describe it. For more information about MLOps, I suggest reading “Machine Learning Operations (MLOps): Overview, Definition, and Architecture”.

Can MLOps address ML scalability challenges at the organisation level?

To answer this question, let’s give the example of an MLOps solution successfully implemented in an online travel agency called “TourStay” (ChatGPT kindly provided the company name), which allows customers to book hotels worldwide through a web application backed by a data platform.

TourStay leverages ML to enrich customers’ user experience through:

  • Recommendation engines: customise offerings based on customer specifics;
  • Chatbots: leverage Generative AI to create 24/7 AI-based customer assistants;
  • UX personalisation: adapt the website user interface for multiple customer segments.

Each one of those use cases is developed by a data science team, belonging to a different organisational unit.

The teams have yet to leverage any MLOps best practices and have struggled to build, deploy and reliably maintain models in production. TourStay has hence decided to improve its ML capabilities and create an MLOps framework for the teams.

TourStay’s strategy is engraved in the words: “Think Big, Start Small, Learn Fast”. It has then decided to start with a single use case, image personalisation, developed by the UX personalisation team, and consequently scale it up to onboard other use cases.

The logical architecture includes different technical components derived from applying MLOps principles. It looks like the image below (an in-depth description is provided in the whitepaper mentioned before).

Figure 1: End-to-end MLOps architecture and workflow with functional components and roles. From “Machine Learning Operations (MLOps): Overview, Definition, and Architecture”.

TourStay hires an MLOps engineer who closely works with business stakeholders, data science, data engineering and web application teams to implement this architecture.
The outcome is an ML platform that allows performing experimentation, feature engineering, model training, model deployment and model monitoring.

Thanks to this platform, the UX personalisation data science team can now reliably productionalise image personalisation models!
It has been measured that the time to deploy ML models in production has been reduced by 70%. The business stakeholders are really happy with this first success and want to scale the solution to be adopted by the other data science teams.

At this point, a question automatically arises: “How do we do that?”.
There are many more questions that the last one brings it up:

  • How do we onboard the other current and future data science teams?
  • Which environments are they going to use? The same environments used by the UX personalisation data science team?
  • Is this platform going to be centralised? If so, should we allocate a team of MLOps engineers to improve and maintain the platform for all the current and future data science teams?
  • Are the same components of this architecture used to productionalise all the models in the whole organisation?
  • Whose ownership does this platform fall under?
  • How can we make sure to manage the governance of the different models properly?
  • How to ensure that data science teams will not “reinvent the wheel”?

To the best of my knowledge and effort, I couldn’t find any public resources that addressed those questions.
As mentioned at the beginning of the post, MLOps defines principles to productionalise ML models, and it stops there. I believe the reason is that we are still in the early stages of ML maturity. The concept of MLOps is still relatively new, and organisations have just started to adopt it.

My answer to those questions is ML Mesh.

ML Mesh: Definition & Principles

ML Mesh can be seen as an extension of MLOps with Data Mesh principles. Specifically, we are abstracting Data Mesh principles and adapting them to the ML context. It’s not a mere “Copy & Paste” operation since the scenario and the main motivations behind the two concepts are quite different.

Not all readers might be familiar with Data Mesh. Due to this, I will present the ML Mesh principles by making little to no reference to Data Mesh whenever possible. I will write a future blog post that describes differences, commonalities, and integration points. If you want to know about Data Mesh in detail, I suggest reading Zhamak’s book.

ML Mesh is defined as a set of principles that aims to distribute the ownership of MLOps processes, in a scalable way, across the whole organisation.

The principles are “ML as a Product”, “Self-Serve ML Platform”, and “Distributed Ownership”, described in the following sections.

ML as a Product

Definition

The principle of ML as a Product applies product thinking to ML models to reduce friction for the users of the models.

The practice of bringing product thinking to non-product teams is not new, and the same can be applied to data science teams when creating ML models. As an example, in the whitepaper “Machine Learning Operations (MLOps): Overview, Definition, and Architecture”, the first step of the workflow is “ML product Initiation”, where the business problem is analysed before thinking of the potential solution. This is in line with product thinking principles.

Moreover, the authors of the paper already use the term “machine learning product” when defining MLOps:

“MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualisation, implementation, monitoring, deployment, and scalability of machine learning products”.

They also state that

to successfully develop and run ML products, there needs to be a culture shift away from model-driven machine learning toward a product-oriented discipline”.

The users of an ML product are not only the end-users of an application. They can also be identified inside the organisation, like a web application team that needs to embed the model into a web application.
It’s essential to maximise the user experience in this case as well. For instance, if the model does not provide a documented and easy-to-use interface or it doesn’t provide the expected outcome when consumed, then this creates collaboration issues between the two teams, resulting in bad outcomes at the organisational level (poor cross-team collaboration that leads to higher time-to-market and, hence, limited ROI).

From this point, I will rewrite “data science teams” as “ML product teams”. Later in the section, I define the ideal roles and personas that an ML product team should be composed of.

ML Product Attributes

Every ML product has to incorporate the following attributes oriented around the experience of its users, like ML product teams and web application teams.

Figure 2: ML Product Usability Attributes

Observable
Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.
An ML product is then observable if it exposes different indicators (external output) relative to the state of its components (internal states).
ML product teams can access those indicators to assess the healthiness of the ML products and quickly fix it if required.

Indicators can be:

  • training/deployment/inference cost;
  • inference response time;
  • model quality and bias drift metrics;
  • model evaluation metrics;
  • model versions;
  • business metrics;

Secure
An ML product is secure if:

  • Is resilient to adversarial attacks, such as input perturbations or data poisoning attacks;
  • all the processes involved in the creation of the product are executed in an isolated and strictly accessed environment;
  • all the processes produce encrypted artefacts (models, data);

Trustworthy
A trustworthy ML product satisfies the following requirements:

  • Fairness: Can you confirm that the machine learning model does not provide a systematic disadvantage to any group of people over another based on factors like gender, orientation, age or ethnicity?
  • Explainability: Can you explain why the model made a specific decision? For instance, if someone applies for a loan, the bank should be able to explain why that person was rejected or approved clearly.
  • Privacy: Are the proper rules and policies in place for various people to access the data at different stages of the AI lifecycle?
  • Robustness: Does the model behave consistently as conditions change? Is it scalable? How do you accommodate drifting data patterns?
  • Transparency: Do you have all the facts relevant to the usage of the model? Are they captured throughout different stages of the lifecycle and readily available?

Interoperable
Interoperability allows:

  1. Integrating ML products with other systems, like web applications;
  2. Combining ML products in a sequential or parallel way to solve a business problem;

While the first point is easy to understand, the second hides different issues.

For instance, let’s consider the popular “Question Answering” use case, using LLMs (Large Language Models) when applying a RAG (Retrieval Augmented Generation) approach.
We use at least two models:

  1. The first model converts documents and prompts into embeddings to find the top k most relevant documents, given a prompt.
  2. The second model uses the top k most relevant documents as context to provide the most accurate answer.

The most relevant documents are retrieved based on similarity metrics between the prompt and document embeddings. Hence, this operation is heavily dependent on the first model. This can bring different challenges, like data dependencies, that must be addressed correctly.
In the example above, the first model could produce unstable embeddings, resulting in detrimental effects in the consuming system that are costly to diagnose and address.

Addressable
An ML product must provide its consumer with a permanent, standardised, and unique address. The unique address must follow a global convention that helps consumers in accessing information like documentation and SLO, and consume it.

ML Product as Architectural Quantum: the Bridge between ML Mesh and MLOps

From an architectural perspective, I find it useful how Zhamak defined a data product as an architectural quantum. From Evolutionary Architecture, the definition of architectural quantum is the following:

“An architectural quantum is an independently deployable component with high functional cohesion, which includes all the structural elements required for the system to function properly”.

It’s also possible to define an ML product as an architectural quantum. In this case, the structural elements are the following.

Code:

  • Experimentation (e.g., notebooks)
  • Training (e.g., code to create the model, perform preprocessing, training/tuning, evaluation)
  • Deployment (e.g. code to package and deploy the model and make it accessible through APIs)
  • Inference (e.g., code to consume the model)
  • Monitoring (e.g. code to monitor bias drift and quality)
  • Orchestration (e.g., pipelines)
  • Policies as code (implement policies as code to apply to each ML product)
  • Interfaces as code: provide access to documentation, metrics, SLO, model versions, model endpoints to be consumed, etc.
  • Infrastructure as code: enables building, deploying and monitoring the model product’s code (e.g., instances on which preprocessing/training/deployment/monitoring jobs run).

Metadata:

  • Model Lineage (e.g. steps to train the model)
  • Model Versions (e.g. train, validation and test data, hyperparameters, metrics)
  • SLO (e.g. response time for real-time endpoints, average number of requests per hour)

Artefacts:

  • Model artefacts
  • Data

If you look at the components defined in the MLOps logical architecture shown in Figure 1, you can notice that almost all of them are included in the structural elements of an ML architectural quantum.

An ML architectural quantum encapsulates an MLOps architecture or, in other words, MLOps principles are implemented to create an ML product.

The case of Features and Feature Engineering processes

An ML architectural quantum encapsulates all the components of an MLOps architecture except for the feature engineering component. The reason for this choice is the following.

Data and models are highly coupled in many aspects. The outcome of an ML model can’t be effectively expressed in software logic without dependency on external data. However, it’s possible to draw boundaries between the two when adopting a distributed approach.

A boundary could be established when features are considered as a product separated from the ML product. The main reason is that the relationship between features and models is many-to-many. It means multiple models can consume the same X features during training.

Luckily, the literature already provides a way to consider features as a data product. And the specific data product definition I’m referring to is described in the Data Mesh approach. Moreover, they seem to fit very well together (who would have thought!).
Implementing data products by implementing a Data Mesh approach is just one of different solutions.

Managing data and models this way provides high decoupling and re-usability, but some critical dependencies must be carefully managed. For instance:

  • The re-training process (ML product) can be triggered whenever new features are created (data product).
  • Based on the output of the model monitoring process (ML product), data quality rules may be adjusted when creating features (data product).

That’s why I will describe in-depth the relationship between data products and ML products in a future blog post, covering all the potential dependencies and how to address them.

ML Product Team Structure

ML products solve a specific business problem; hence, they should be part of the correspondent business domain. This is reflected in the ML products team structure.

The ML product teams must be cross-functional teams that have all the domain knowledge required to create an ML product that successfully delivers business value.
Given that an ML Product encapsulates MLOps practices, it makes sense to consider all the roles involved in creating an MLOps solution.

However, the recommended structure raises an important concern: business and technical people don’t speak the same language. This is a well-known problem and, in general, a huge pain in the a̶s̶ neck :)

The solution relies upon the creation of a language shared and well-understood by both parties that must be engraved not only in the day-to-day conversations, but in the code itself as well. Eric Evans addressed this issue quite a few years ago by introducing the concept of ubiquitous language.

Self-Serve ML Platform

Definition

In the previous section, I described the concept of ML product as an architectural quantum that encapsulates an MLOps architecture.
The next step is to understand how to scale out the lifecycle management of ML products across the whole organisation.

Different challenges could arise, such as:

  • Duplicate effort when creating ML products in a decentralised way;
  • Lack of visibility of ML products in terms of cost, ownership, usability and business value;
  • Different standards can be used to access and consume ML products;

The principle of self-serve ML platform addresses those issues by defining a platform that:

  • Empowers ML product teams in creating and maintaining high-quality ML products in a fast, frictionless and autonomous way;
  • Allows business stakeholders and technical teams to have a governed and centralised view of every ML product and its details;
  • Define a global standard to access and consume ML products;

The platform should be seen as a set of services managed by a team that provides functionalities through standard interfaces, described in the next section.

Based on the organisation’s strategy, the platform can either be implemented in the form of

  • A single and tightly integrated platform, often sold by a main vendor or a mix of different technologies;
  • Open source or sold by other vendors;

Both solutions have pros and cons. The second solution is more complex and challenging to maintain due to the different technologies to be integrated, but it offers more flexibility, for instance, by implementing a multi-cloud strategy. As opposed, the first solution is simpler to manage since the components are designed to work together, but it creates a lock-in to a vendor.

Intefaces

As a not exhaustive example of foundational interfaces, the platform must allow creating, deploying, monitoring, deleting and describing ML products.

Create ML Product
ML product teams should be able to quickly set up an ML product by leveraging the “Create ML Product” interface. The concept of “template” comes in handy in this case.

As an example, consider the image personalisation use case. To simplify, let’s assume that this use case can be implemented by considering just one model that provides the best images based on customer segment data. The model needs to be invoked in real-time to process millions of requests per day. Those requirements can be encapsulated in a pre-defined template called “image-personalisation” that provides a skeleton configuration regarding repositories, pipelines, training code, etc. In this way, the ML product teams just need to deploy a new template with one click and start working on that.

Templates are maintained by the ML platform team and used by the ML product teams accessing the platform to create a new ML product.

Templates play a critical role in speeding up the work of ML product teams; that’s why I will dive deep into explaining how the platform manages them in a future blog post.

Deploy ML Product
Deploying ML products should be as easy as creating them. The platform leverages the addressability attribute of ML products to allow data scientists to deploy them without worrying about infrastructure operations.

This operation should be restricted to be executed only by senior and lead roles. This can be achieved by creating policies as code (part of the code structure elements of an ML product).

Delete ML Product
There are specific scenarios in which an ML product is not useful anymore. A reason could be that a product is entirely replaced by another one because it solves the same business problem in a better way, and it’s more practical to create a brand new one instead of creating a new version of the old one. In this case, the platform needs to provide ML product teams with the possibility to delete the products.
As the “Deploy ML Product” interface, this should also be restricted.

Describe ML Product
The platform leverages the observability attribute of ML products to allow business stakeholders and technical teams to check and validate different metrics. This practice improves overall maintainability, model understanding and helps standardise business stakeholders’ decision-making processes.
A great way to organise metrics and other indicators for each ML product is through Model Cards.

Note: all the platform interfaces, except for “Create ML product”, access the interfaces the ML products expose.

Distributed Ownership

Definition

Defining ownership boundaries between the ML platform and the ML products is essential.

ML product teams should be owners of the end-to-end lifecycle of a product and be able to work independently in delivering it. This means that every ML Product has to be isolated from the others. For instance, the deployment of one product should not disrupt other products.

If multiple ML products are composed to create a high-order product, then the same team should own all of them. Those ML products are highly tight, and the update of one could drastically change the output of others.

The ML platform team should work independently to create new capabilities that ML product teams will use. In other words, the platform doesn’t manage the lifecycle of any products; they are entirely independent.

By distributing the ownership of products and platforms in this way, dependencies between teams are reduced, and consequently, go-to-market is also decreased.

ML Product and Platform Relationship

ML products and platform relationships can be described from team and architectural perspectives. A good way to represent the relationship is through an hub and spoke model. This model is generally defined as centralised, however, when applied to ML Mesh, the centralised part is only about managing templates, interfaces and toolsets, as described in the following section.

Figure 3: Hub&Spoke ML Mesh architecture

Team Perspective
The hub is the platform team, while the spokes are the product teams.

The hub:

  • Creates and shares templates;
  • Exposes interfaces to create and interact with ML products;
  • Provides toolsets to speed up the creation of ML products and avoid duplicate effort;

The spokes interact with the hub by leveraging the interfaces described in the previous section.

Architectural Perspective
The only connection point between the platform and products is the interfaces. This is why a robust standard is needed when defining them. ML products expose interfaces used by the platform (interfaces as code defined previously as structural elements of the ML product architectural quantum).

In this case, the hub is the ML platform, while the spokes are the several ML products. The spokes communicate with the hub through the interfaces.

The following picture shows the ML Mesh logical architecture that considers teams, the ML platform and the ML products.
The platform exposes different interfaces ML product teams use to create and manage ML products. The ML products also expose interfaces used by the ML platform.

Note that “ML Product Team A” owns “ML Product 1” and “ML Product 2”. The two are connected together to compose a higher-level ML product.

Figure 4: ML Mesh Logical Architecture

Distributed Ownership and Organizational Structure

It’s worth spending a few words about how implementing distributed ownership is connected to the company’s organisational structure.
For instance, a decentralised approach like this is not compatible with a pure functional organisational structure. Organisations like this are extremely rare, so this is not much of a problem.

It would not make sense to point to a specific structure perfectly compatible with ML Mesh, but a minimum requirement would be to leverage business unit functions that are mapped to ML Product teams.

Conclusions

This post introduced a new concept, called ML Mesh, that aims to distribute MLOps practices across an organisation in a scalable way.

We started by introducing the challenge: currently, no practices explain how to scale MLOps to be adopted by many different data science teams in an organisation.
The proposed solution to this problem is ML Mesh. The post introduced ML Mesh and its principles, “ML as a Product”, “Self-Serve ML Platform”, and “Distributed Ownership”.

As mentioned at the beginning of the post, this is just the first of a series of posts. Many aspects need to be described in more detail, starting from the principles themselves. My goal is to provide an additional perspective to the current state of MLOps and to receive feedback from people who have already scaled (or are still in the process of scaling) the MLOps solution they have implemented.

Indeed, practices to scale up ML solutions are already in place. For instance, Uber built Michelangelo to scale its ML processes across the organisation. If you notice, that solution, in its form, applies the principles that ML Mesh describes (yes, expect another article that maps ML Mesh principles to Uber’s ML solution).

What’s missing is to put on paper those practices, and ML Mesh might fill this gap.

--

--