Implementing Machine Learning Initiatives as Pods

VIKRANT SINGH
4 min readAug 29, 2023

--

The presence of machine learning has been established for many years now, and during this span, numerous organizations have gained expertise in this field. Many are currently striving to establish Centers of Excellence (COE) for data science. In this piece, we will delve into the process of setting up an AI COE within any organization. The insights shared are based on personal perspectives and derive from my experience collaborating with diverse enterprises.

The AI COE comprises three principal components: Data, MLOps platform, and ML teams. Among these components, data stands out as the pivotal factor for executing any machine learning project. In any organizational setting, data is distributed across various databases. For instance, Enterprise Resource Planning (ERP) data is stored within an Oracle database, sales data resides in an MS SQL database, and other data might be housed in separate cloud systems apart from the primary data warehousing solution. Initiating a data warehouse serves as the foremost and critical step in the COE journey. Centralizing data from diverse systems into a singular database facilitates seamless experimentation with ML use cases. Continual ingestion of data from disparate sources is an ongoing necessity to ensure access to the most up-to-date information.

The subsequent stride entails establishing a unified MLOps platform accessible to all ML teams within the organization. This platform is orchestrated by a team primarily composed of MLOps professionals who consult data scientists on selecting appropriate tools and frameworks across the entire data science lifecycle. Ensuring strict data isolation within this MLOps platform is imperative. The MLOps team will manage compute resources. Any identified use case slated for development will be onboarded via this team. A more comprehensive understanding will emerge through an illustrative example.

The third and final facet involves ML pods. Rather than operating as a single ML team, a more effective approach involves creating separate pods. Each pod concentrates on distinct use cases, thereby avoiding overlap. Initially, a pod comprises a data engineer, a data scientist, and a product owner. This team’s objective revolves around identifying opportunities to implement ML models — not merely for the sake of implementation, but to address pain points or bolster the organization’s revenue.

The journey of an ML model, progressing from prototype to production, involves a sequence of execution steps:

  1. Identifying the use case
  2. Data exploration
  3. Prototyping/Proof of Concept (POC)
  4. Managing expectations
  5. Quantifying impact in terms of value or human hours
  6. Operationalizing the use case at production scale
  7. Running and providing support

Different teams oversee each of these steps. A single pod manages all these steps for a given use case. However, pod members may evolve over time depending on the ongoing step.

To illustrate this process, consider an example using Google Cloud Platform (GCP) services. While this example may seem to involve manual steps, it’s worth noting that DevOps tools could automate many of these processes. A basic familiarity with GCP services is assumed for comprehension.

Imagine a company named ABC, primarily operating in the food delivery sector. The entirety of their data is ingested within BigQuery, a data warehouse, primed for exploration. Let’s introduce an ML Pod, named ‘Alpha Pod,’ within this context. Alpha Pod’s objective is to predict delivery times for orders placed in the system. This pod consists of three members: a data engineer, a data scientist, and a business owner with an in-depth understanding of the data.

Upon identifying the problem and creating the pod, the team approaches the MLOps team for onboarding. The MLOps team sets up JupyterLab within the Vertex AI Workbench, restricting access to Alpha Pod members. The data engineer ensures necessary data access for the use case is assigned to the JupyterLab-associated service account. With access established, the data scientist embarks on data exploration, with the option to consult the product owner for a deeper grasp of the data.

Transitioning to the prototyping phase, data scientists experiment with diverse models and create a prototype. This involves utilizing the full production data rather than a subset, highlighting that ‘prototype’ refers to the manual training within the notebook. Once the model is refined, the focus shifts to ‘setting expectations.’ Extensive consultation with the business owner ensues to establish metric expectations. For instance, the Root Mean Square Error (RMSE) might serve as a performance metric. The data scientist communicates the expected minimum RMSE to the product owner based on current performance.

The subsequent ‘Quantifying Impact’ stage involves computing the monetary impact attributable to the model. This estimate reflects the anticipated value generated over a specific timeframe, be it a week, quarter, or year.

With all prior stages validated and approved for production, attention shifts to ‘operationalization.’ This pivotal phase involves deploying the model in production, necessitating additional team members, predominantly MLOps professionals and visualization/web engineers. The MLOps resources transform the prototype code into a production-ready format, deploying the model on Vertex AI with requisite compute resources. These resources oversee auto training, continuous integration, deployment, and validation. Once the model is operational, the final phase, ‘run and support,’ begins.

In the closing pod phase, a run support team, comprising an MLOps resource and a visualization expert, assumes responsibility for managing multiple use cases in production. Sustaining these use cases requires minimal effort due to enabled auto training based on drift within production pipelines.

This approach allows the simultaneous operation of multiple pods, optimizing resource utilization as per varying demands.

Thank you for your readership. Stay connected by following my Medium profile or subscribing to my blog for updates. For further engagement, connect with me on LinkedIn at https://www.linkedin.com/in/vkrntkmrsngh/."

--

--

VIKRANT SINGH

Talks about MLOPS, Generative AI, Machine Learning and Cloud