Essentials of an MLOps Platform Part 3: Services

Rik Kraan
bigdatarepublic
Published in
7 min readJul 6, 2023

--

An MLOps platform is a collaborative working environment for machine learning engineers that should facilitate iterative experimentation and model deployment. Once decided which applications should be available on the platform (described in part 1) and with the basic infrastructure in place (described in part 2), it’s time to provision the MLOps platform so that the users can start building machine learning capabilities. This post will cover a hands-on tutorial on the provisioning of an MLOps platform, including how to handle authorization and use the GitOps paradigm to keep your applications up-to-date and self-managed.

Photo by Pan Yunbo on Unsplash

Authentication

Before installing any application on the platform user authentication should be strictly managed. By design, the platform will expose some public endpoints (e.g. to access a JupyterHub instance). The authentication process should ensure that these endpoints are only accessible to the intended user base and facilitate giving (and revoking) access to specific users. Although the concepts are cloud-agnostic, in this reference example the platform is deployed on Google Cloud Platform (GCP) and uses GCP’s Identity Aware Proxy (IAP) service for authenticating users within a specific domain platform-wide.

On the platform, we have 2 types of applications;

  1. Applications that are not required to be aware of a user’s identity (e.g. MLflow which can be accessed by all users). Note; in some situations, it might be desirable to make a certain part of MLflow accessible for specific teams within the organization.
  2. Applications that need to know the identity of the user and make it user-specific (e.g. getting access to a specific set of monitoring metrics).

For the first type of application, IAP authentication is sufficient. The second type of application requires some additional configuration.

The ideal scenario would be to propagate the identity of a user by collecting the JSON Web Token (JWT) that is created by IAP. In this situation, a user is only required to sign in once using his/her Google account and can thereafter access all applications (to which he/she has been given access). However, it turns out that fetching the JWT and using it for user identification is not something that most applications can provide out of the box. The applications on the platform use the JWT if possible and otherwise use OAuth for additional application-specific authentication and require an additional login with a Google account.

ArgoCD

ArgoCD is (as its name suggests) an application for the continuous delivery of applications on a Kubernetes cluster using the GitOps paradigm. ArgoCD is implemented as a Kubernetes controller that constantly monitors applications running in the cluster. It compares the current status with a target status (which is a specified branch in a Git repository). Once the controller detects a difference between the current and the target status it will visualize that the application is OutOfSync and automatically (or manually if specified) update the application to the desired status. In addition, it is also able to delete resources or recover resources that have been manually deleted.

For our platform, this ensures that our applications are always updated to the latest status and rolled out to all users if any changes are made. While we intend to make all the platform’s applications self-managed (including ArgoCD), ArgoCD should be installed (see part 2 of the series) when creating the cluster after which it will maintain itself.

To be able to start working with ArgoCD we need to do three things:

  1. Implement user authentication
  2. Allow ArgoCD to read from a repository
  3. Install and configure ArgoCD

Configure user authentication

The Kubernetes external secrets operator is used to access the GCP secrets manager from Kubernetes. First, we need to create Oauth credentials to enable ArgoCD to authenticate users. Unfortunately, it’s not possible to create these credentials in an automated way using Terraform. Creation has to be done in the GCP console, and can thereafter be added to our cluster in the argocd namespace as ExternalSecret by applying the following yaml.

With these secrets available on our cluster we can ingest them in a ConfigMap by applying the following yaml.

In addition, Role-Based access control (RBAC) can be used to manage roles (e.g. admin) for specific users (or user groups). The example below adds two admins to the ArgoCD, but this can also be done via Google groups.

Allow ArgoCD to read our repository

As mentioned before, ArgoCD continuously monitors Git repositories for updates. It can only do so if it is granted read access to the repository it should monitor. For each repository that should be monitored, an SSH key has to be created manually (e.g. in GitHub). This key should be added to the cluster in order for ArgoCD to access it. In our case, we added the key as a secret in the GCP console. The secret looks similar to:

{"name":"mlops-architecture","sshPrivateKey":"-----BEGIN OPENSSH PRIVATE KEY-----\nSECRET!! -----END OPENSSH PRIVATE KEY-----","url":"your-repo-url"}

After adding the secret to the GCP console it can be added to the cluster by creating an ExternalSecret. If the secret-type is set to repository ArgoCD will detect the secret and use it when trying to fetch a repository’s latest status. Below is an example of a yaml file that creates such a secret.

Install and configure ArgoCD

After setting up the authentication and putting all secrets in place (within the right namespace), the next step is to install ArgoCD itself. The simplest method is to apply the standard install.yaml file. This will set up ArgoCD using all default settings. The last step in the process is to make ArgoCD accessible to users from their browsers. To make this happen, we have to create a VirtualService that configures the Ingress controller (in our reference example this is Istio (see part 2 of the series)) to direct the traffic to a certain domain name to the ArgoCD application.

Applications

With user authentication and ArgoCD in place, it’s time to actually make the platform useful and provision it with applications that the user can work with. Before actually installing the applications, we decided that the provisioning of all applications should be done via the same format and chose to use helm charts. By adhering to this strategy setting up a new application on the platform is easy and can be done in 3 steps:

  1. Make a (or extend upon an existing) helm chart and add configuration parameters.
  2. Create an ArgoCD manifest and apply it to the cluster
  3. Add (if necessary) an ingress rule to enable users to reach the application

One of the applications installed on the platform is JupyterHub, below is a hands-on tutorial on how to install JupyterHub on the platform.

Create the JupyterHub application

First, we leverage the already existing JupyterHub helm chart. The Chart.yaml contains just the following code.

In addition, a values.yaml file is created in the same directory that configures the application. In this example, users are for example redirected to the /lab URL by default and get a storage capacity of 2Gi. This can be extended by configuring all possible settings that can be found in the official repo.

In addition to the configurations, JupyterHub needs to identify users in order to start up and point them to their own notebook server. As described before, in an ideal scenario the JWT that is created by IAP will be used, but this functionality is not provided out of the box. Therefore an additional identification process is configured. To facilitate the login process for users we leverage Oauth so that users can log in with the same account they use for getting authenticated by IAP. Below is an example file that makes a new ExternalSecret for JupyterHub. It fetches the client_id and client_secret that were created in the GCP console before (see above). Note: it is not possible to parse these credentials in the normal values.yaml .

ArgoCD manifest

After setting up the configuration the application can be installed by applying a manifest on the Kubernetes cluster. This manifest is very simple and basically only consists of a pointer to the right repository & branch.

Ingress rule

To enable users to access JupyterHub through their browsers, we have to create a VirtualService that configures how the incoming traffic should be redirected. This is identical as described in the section on ArgoCD.

Conclusion

This blog provides an overview of how to provision an MLOps platform and install ArgoCD to let applications be self-managed. Although managing applications with ArgoCD is fairly straightforward in the end, it does require some manual steps to configure in the first place. In our opinion, the most important thing for creating a successful platform is to adhere to a certain format for installing and configuring applications in order to make it reproducible for all engineers working on the platform. It is worth mentioning that (some) open-source packages do not provide functionality for user identification by fetching a JWT out-of-the-box, which is something that will probably be added in the future.

--

--