MLOps Operating Models: finding the right fit

Published in

Marvelous MLOps

9 min readFeb 4, 2024

As enterprise businesses embrace machine learning (ML) across their organizations, manual workflows for building, training, and deploying ML models tend to become bottlenecks to innovation. To overcome this, enterprises need to shape a clear operating model defining how multiple personas, such as data scientists, data engineers, ML engineers, IT, and business stakeholders, should collaborate and interact; how to separate the concerns, responsibilities, and skills; and then finally how to use the technology optimally.

Image generated by the author using DALL-E

Indeed, there is no universally applicable operating model for Machine Learning (ML). The ideal ML operating model varies depending on factors such as the client’s MLOps maturity, the scale of the business, and industry-specific considerations like the geographical distribution of data sources or teams. It’s important to recognize that this operating model is not a fixed entity; instead, it evolves over time, adapting to the changing demands and capabilities required for scaling ML effectively.

Often, ML projects begin modestly with few individuals or small teams led by local business units or departments. However, as the project’s scope expands, a shift towards a more centralized model with greater oversight and IT involvement becomes necessary. Eventually, when the focus shifts towards scalability and industrialization, the organization may find it imperative to establish a hybrid approach. This combines a centralized factory model that places strong emphasis on data and solution governance with the integration of data science teams into the various business units.

Drawing from my experience working with customers globally, optimizing the organization of your teams is a pivotal step in achieving substantial business outcomes. It profoundly influences the synergy between business and data teams, fostering effective communication, and facilitating the seamless exchange of information among diverse data teams within the organization.

In broad strokes, there are 3 primary approaches to structuring your data team plus some combinations or hybrid approaches. As a reference, one advise that usually works well to establish the right operating model is based on the number of models/use cases in production:

Decentralised

Resources within the organization are often distributed across various silos, with each unit operating independently, and there is often a lack of visibility into ML activities beyond each unit’s scope. Despite having numerous individuals with expertise, they may not be strategically aligned or working collaboratively towards common goals.

Pros of a Decentralised Model

Functions have Enhanced Resource Oversight: Functions within the organization have improved control and visibility over their resources, ensuring more efficient allocation and management.
Proximity of Business Skills to Teams: Business skills are now situated closer to the data and ML teams. This proximity fosters a deeper understanding of the business context within the teams, facilitating better alignment of ML initiatives with strategic objectives.
Clarity in Business Understanding: There is greater clarity regarding how ML solutions align with the business’s needs and goals. This clarity enhances the relevance and impact of the solutions.
Business-Led Solutions: ML solutions are perceived as being “business-led,” which simplifies the adoption process and promotes a sense of ownership and commitment among business stakeholders.

Cons of a decentralised model

Isolated individuals: The ML and analytics individuals are geographically scattered or organized in a decentralized manner, making collaboration and coordination challenging.
Limited Visibility of Analytics Activities: There is a lack of transparency and visibility into ongoing analytics activities, making it difficult to assess progress and outcomes.
Challenging Governance and Standards Management: Establishing and maintaining governance policies and standards for analytics initiatives is a complex task due to the dispersed nature of the organisation.
Limited Knowledge and Asset Sharing: Knowledge and assets related to analytics are not effectively shared across the organization, which hinders efficiency and innovation.

Functional Operating Model

In a functional organizational model, data science teams are often distributed across various business functions, divisions, or product lines within the organization.

Pros of a Functional Operating Model

Resource Alignment with Demand: Resources are strategically placed in alignment with the immediate needs of various business functions or product lines within the enterprise.
Specialization: Team members can specialize in their respective functional areas, allowing for in-depth expertise tailored to specific business needs.

Cons of a Functional Operating Model

Coordination Challenges: Coordinating efforts and knowledge sharing across dispersed teams can be complex, potentially leading to inefficiencies.
Dispersed Team: The data science team is spread across different parts of the organization, making centralized management and collaboration more difficult.
Lack of Common Framework: The absence of a common technical and functional framework can result in inconsistencies in approaches, tools, and methodologies.
Functional Dominance: The busiest or most resource-intensive business unit may dominate the functional team’s priorities, potentially neglecting the needs of other units.
Scaling Challenges: Scaling data science initiatives across the entire enterprise can be challenging due to the decentralized nature of the functional teams

Centralised Operating Model

In a centralized organizational model, all teams, resources, tools, and data are consolidated in a single location, and access to these assets is restricted to units that are part of the centralized unit.

Pros of a Centralised Operating Model

Effective Governance and Standards: Centralized control makes it easier to establish and enforce governance policies and standards consistently across the organization.
Resource Centralization: All resources, including teams, tools, and data, are concentrated in a single location, enhancing management and oversight.
Strategic Management: Central management provides a clearer strategic direction for the team, ensuring alignment with the organization’s mission and goals.
Facilitated Collaboration and Asset Sharing: Collaboration and sharing of assets among team members are more straightforward, fostering efficiency and knowledge sharing.

Cons of a Centralised Operating Model

Limited Co-Creation Opportunities: The centralized nature may limit opportunities for data science teams to co-create solutions in close collaboration with business stakeholders.
Distance from Business Processes: Teams may be somewhat removed from certain business processes and may not have an in-depth understanding of them.
Challenging Solution Adoption: The centralized model might make it more challenging to adopt solutions since the team may be somewhat distant from the day-to-day operations and needs of the business units.

Federated Operating Model

In a federated operating model, certain shared services functions, such as code repositories, CI/CD pipelines, and the ML platform, are managed by a centralized team, while individual business units are overseen by decentralized teams. This approach aims to combine the strengths of both centralized and decentralized models.

Pros of a Federated Operating Model

Contextual Understanding: Similar to the decentralized model, teams in the federated approach maintain a deep understanding of their respective business partners’ needs, ensuring relevance and alignment.
Skill Development and Transferability: When all teams use the same solutions, it becomes easier for data engineers or data scientists to transfer between teams, as they don’t have to learn new tools but can instead focus on understanding the specific business context.
Economies of Scale and Avoidance of Duplicative Work: Centralizing expertise and resources allows the organization to benefit from economies of scale and prevent redundant efforts across different units.

Cons of a Federated Operating Model

Potential Bureaucracy: There is a risk that the federated approach might introduce bureaucratic processes, potentially slowing down responsiveness to the unique needs and challenges of individual business units. This could impede innovation.
Innovation May Be Affected: The extent of innovation depends on the priorities and incentives of the core data infrastructure team. If this team is not motivated to continually enhance the infrastructure, it may not meet the evolving needs of the business units.
Possible Boredom for Engineers: In some cases, data engineers or team members may find the federated model less stimulating, particularly if their work primarily involves maintaining existing infrastructure rather than innovating or solving novel challenges.

Federated + CoE Operating Model

Center of Excellence Operating Model (Diagram created by the author)

In a Center of Excellence (CoE) model, operations and activities are centrally coordinated and tracked from a single CoE, even though a majority of the team is distributed across different areas of the organization.

Pros of a Federated + CoE Operating Model

Resource Coordination with Local Embedding: Resources are coordinated by the central location but are embedded within the business units. This fosters closer relationships between the CoE and the various business functions.
Alignment with Strategic Goals: The CoE model facilitates better alignment with the enterprise’s strategic goals, particularly those related to AI and analytics. It enhances the organization’s ability to strategically leverage these technologies.
Enterprise-Wide Initiatives: The centralized coordination encourages the creation of more initiatives that span across the entire enterprise, promoting collaboration and knowledge sharing.

Cons of a Federated + CoE Operating Model

Governance Challenges: Ensuring proper governance of resources and tracking the development needs of individuals requires careful maintenance to avoid potential inefficiencies or misalignment.
Risk of Becoming a Resource Pool: There is a risk that the CoE could transform into a resource pool primarily focused on providing talent rather than actively driving strategic growth initiatives.
Overemphasis on Technical Excellence: There is a danger that the CoE may become too narrowly focused on technical excellence, potentially neglecting the overarching business priorities and goals.

Factory Operating Model

In a Factory model, teams are structured to prioritize the industrialization of AI and data science solutions, encompassing the entire lifecycle. This model can be organized centrally or dispersed based on functional capabilities, such as data management, experimentation, testing, or industrialization.

Pros of a Factory Operating Model

Facilitates Scalability: The Factory model is well-suited for deploying AI and data science solutions at scale, ensuring efficient operations as the organization grows.
Enhanced Synergies and Reusability: It makes it easier to identify synergies across solutions, leading to more effective asset reuse and knowledge sharing.
Specialized Resource Allocation: Resources are assigned to squads based on their technical specialties, optimizing skill utilization and project efficiency.
Integrated Business Resources: Business resources, such as product owners, can be seamlessly integrated when needed, ensuring alignment with business objectives.

Cons of a Factory Operating Model

Greater Organizational Overheads: Implementing the Factory model requires additional organizational and governance overhead, which can be resource-intensive.
Coordination and Buy-In: Achieving success with this model demands coordination and buy-in across multiple business units, involving both business and IT stakeholders.
Need for Experienced Resources: Starting and running the factory effectively may require initially having more experienced and highly skilled resources, potentially posing challenges in resource availability.

Conclusions

In summary, discovering an effective MLOps operating model is pivotal, ensuring that data teams don’t feel overwhelmed or misaligned while delivering quick results to the business. Striking the right structure for your data organisation encompasses crucial factors such as business alignment, speed, collaboration, and maintenance costs. These considerations are vital when establishing your organizational framework for a successful MLOps adoption.

Absolutely, it’s crucial to emphasize that operating models should be dynamic and adaptable over time. In the initial stages, when the number of use cases is limited, decentralized operating models often deliver efficiency and rapid results. However, as the organization grows and the demand for scaling ML increases, federated approaches tend to become more effective. This transition can indeed be a critical factor in determining the success of companies when implementing machine learning within their organization. Flexibility in adjusting the operating model according to evolving needs is paramount in achieving long-term success in the field of machine learning.