Speeding up deployment with shared data science environments

Vechtomova Maria
Marvelous MLOps
Published in
5 min readSep 28, 2023

The goal of MLOps is enabling data science teams to deploy machine learning models with as little overhead as possible and reduce time to production. Shared data science environment may come in handy to achieve this goal.

What is a shared data science environment?

A data science environment is a set of infrastructure needed for machine learning model deployment. In this article, we discuss the MLOps components.

So in the case of shared environment, we are talking about:

  • set of repositories that follow certain naming convention
  • shared CI/CD
  • shared orchestration system
  • shared model registry
  • shared container registry
  • shared compute/serving

This is not just one environment, but a set of environments (see our article about deployment strategies) with access to production data.

Why do you need a shared data science environment?

In some scenarios, having a shared environment for data scientists might be beneficial. For example, when you have teams building machine learning models for a specific business domain, and when data is shared across models per domain.

For retailers, typical business domains would be demand forecasting, pricing, quality & control, loyalty & personalization.

If a team is busy with loyalty & personalization, they may work on:

  • personalized offers (personalized discounts for products the customer is likely to buy to expand the range of purchased products)
  • cross-sell (offering other products customers may like based on the products in the basket)
  • top products in the assortment based on the customer’s purchase history.

Example of a shared environment

Let’s see how deployment on a shared environment can be achieved for our tool stack (Github, GitHub Actions, Databricks, MLflow, AKS, ACR) for the Loyalty & Personalization (LP) team.

To train a machine learning model on Azure Databricks & serve on AKS, you the Loyalty & Personalization team needs to have 2 Service Principals (for production and preproduction environment) with the permissions described below. Service Principal (SPN) is an application within Azure Active Directory, which is authorized to access resources in Azure.

Handling the secrets

Since environments and service principals are shared, we do not want to store credentials in each separate repository: in case credentials expire/change, we would have to update them everywhere.

We propose using the GitHub organization secrets in the following way:

  • All the repositories that belong to the Loyalty & Personalization team, follow the naming convention (start with lp-).
  • Required GitHub organization secrets that start with LP_<environment> are applied to the repositories that start with lp-.

However, there are 2 problems with this setup:

  • Azure credentials expire. How do we update them automatically?
  • What if we want to create a new repository? How do we update the scope of the organization secret?

GitHub Actions workflows to the rescue

  1. Updating Azure credentials

To update the Azure credentials organization secret, we need a parent SPN (that has permission to manage secrets for all SPNs belonging to various data science teams) and a GitHub app.

We have a separate SPN management repository where parent SPN is authenticated via OpenId connect. In that repository, a workflow gets triggered monthly to check whether any of child SPN secrets are about to expire within a month. If this is the case, a new secret is created, GitHub organization secrets are updated using a GitHub app with necessary permissions, and old Azure secrets are deleted.

2. Updating the organization secret scope

We recommend creating a dedicated GitHub team for each of the data science teams, and a separate cookie-cutter repository from which members of certain GitHub teams can trigger a workflow to create repositories.

Only if the cookie-cutter workflow is triggered, the organization secret is applied to the repository. In this way the usage of the cookie cutter can be enforced, together with branch protection rules. See our article on cookie cutter workflows.

This is not just a cookie-cutter, after creating the repository, you will have a fully functional deployment workflow with all permissions & guardrails needed to deploy your model.

Going to production will take minutes (given the model code is available and tested) instead of days/weeks that you may spend requesting permissions & setting up secrets.

In large organizations, you will very likely depend on other teams for that piece of the process. Other teams with own priorities, backlog. If the dependency can be minimized, go for it.

Conclusions & considerations

Shared environments can help to significantly speed up the deployment process for data science projects, but will not work for all use cases. For example, if a project requires PII data, your security officer may ask you to isolate the environment.

Another important point to mention is ability to separate the deployment costs per project when the shared environment is used. In case of Databricks, it can be done by adding tags to Databricks job.

--

--