Install Kubeflow with Confidential Computing VMs on Microsoft Azure*

Leverage secure and confidential virtual machines (VMs) with Intel® Software Guard Extensions in Kubeflow Deployments

Benjamin Consolvo
7 min readAug 21, 2023
Photo by Rob King on Unsplash

Many machine learning applications must ensure the confidentiality and integrity of the underlying code and data. Until recently, security has primarily focused on encrypting data that is at-rest in storage or being transmitted across a network, but not on data that is in-use. Intel® Software Guard Extensions (Intel® SGX) provide a set of instructions that allow you to securely process and preserve the application code and data. Intel SGX does this by creating a trusted execution environment (TEE) within the CPU. TEEs allow user-level code from containers to allocate private regions of memory, called enclaves, to execute the application code directly with the CPU.

With the Microsoft Azure* confidential computing platform, you can deploy both Windows* and Linux* virtual machines leveraging the security and confidentiality provided by Intel SGX. These machines are powered by 3rd Generation Intel® Xeon® Scalable processors and use Intel® Turbo Boost Max Technology 3.0 to reach 3.5 GHz. This tutorial will walk through how to set up Intel SGX nodes on an Azure Kubernetes* Service (AKS) cluster. We will then install Kubeflow*, the machine learning toolkit for Kubernetes that you can use to build and deploy scalable machine learning pipelines.

This module is a part of the Intel® Cloud Optimization Modules for Microsoft Azure, a set of cloud-native open source reference architectures that are designed to facilitate building and deploying Intel®-optimized AI solutions on leading cloud providers, including Amazon Web Services (AWS)*, Microsoft Azure*, and Google Cloud Platform* (GCP).

Each module, or reference architecture, includes a complete instruction set and all source code published on GitHub*. Before starting this tutorial, ensure that you have downloaded and installed the prerequisites. Then from a new terminal window, use the command below to log into your account interactively with the Microsoft Azure command-line interface.

az login

Next, create a resource group that will hold the Azure resources for our solution. We will call our resource group intel-aks-kubeflow and set the location to eastus.

# Set the names of the Resource Group and Location
export RG=intel-aks-kubeflow
export LOC=eastus
# Create the Azure Resource Group
az group create -n $RG -l $LOC

To set up the AKS cluster with confidential computing nodes, we will first create a system node pool and enable the confidential computing add-on. The confidential computing add-on will configure a DaemonSet for the cluster that will ensure each eligible VM node runs a copy of the Azure device plugin pod for Intel SGX.

The command below will provision a node pool using a standard virtual machine from the Dv5 series, which is a 3rd Gen Xeon CPU. This is the node that will host the AKS system pods, like CoreDNS and metrics-server. The following command will also enable managed identity for the cluster and provision a standard Azure Load Balancer. If you have an Azure Container Registry that you have already set up, you can attach it to the cluster by adding the parameter

--attach-acr <registry-name>
# Set the name of the AKS cluster
export AKS=aks-intel-sgx-kubeflow

# Create the AKS system node pool
az aks create --name $AKS \
--resource-group $RG \
--node-count 1 \
--node-vm-size Standard_D4_v5 \
--enable-addons confcom \
--enable-managed-identity \
--generate-ssh-keys -l $LOC \
--load-balancer-sku standard

Once the system node pool has been deployed, we will add the Intel SGX node pool to the cluster. The following command will provision two four-core Intel SGX nodes from the Azure DCSv3 series. A node label has been added to this node pool with the key intelvm and the value sgx. This key/value pair will be referenced in the Kubernetes nodeSelector to assign the Kubeflow pipeline pods to an Intel SGX node. The following command sets up the node pool:

az aks nodepool add --name intelsgx \
--resource-group $RG \
--cluster-name $AKS \
--node-vm-size Standard_DC4s_v3 \
--node-count 2 \
--labels intelvm=sgx

Once the confidential node pool has been set up, obtain the cluster access credentials and merge them into your local .kube/config file using the command below.

az aks get-credentials -n $AKS -g $RG

We can verify that the cluster credentials were set correctly by executing the command below. This should return the name of your AKS cluster.

kubectl config current-context

To ensure that the Intel SGX VM nodes were created successfully, run:

kubectl get nodes

You should see two agent nodes running beginning with the name aks-intelsgx.

To ensure that the DaemonSet was created successfully, run:

kubectl get pods -A

In the kube-system namespace, you should see two pods running that begin with the name sgx-plugin. If you see the above pods and node pool running, this means that your AKS cluster is now ready to run confidential applications, and we can begin installing Kubeflow.

Install Kubeflow on an Azure Kubernetes Services (AKS) Cluster

To install Kubeflow on an AKS cluster, first clone the Kubeflow Manifests GitHub repository.

git clone https://github.com/kubeflow/manifests.git

Change the directory to the newly cloned manifests directory.

cd manifests

As an optional step, you can change the default password to access the Kubeflow Dashboard using the command below:

python3 -c ‘from passlib.hash import bcrypt; \
import getpass; \
print(bcrypt.using(rounds=12, ident=”2y”).hash(getpass.getpass()))’

Navigate to the config-map.yaml in the dex directory and paste the newly generated password in the hash value of the configuration file at around line 22 in common/dex/base/config-map.yaml

    staticPasswords:
- email: user@example.com
hash:

Next, change the Istio Ingress Gateway from a ClusterIP to a LoadBalancer. This will configure an external IP address that we can use to access the dashboard from our browser.

Navigate to common/istio-1–16/istio-install/base/patches/service.yaml and change the specification type to LoadBalancer at around line 7.

apiVersion: v1
kind: Service
metadata:
name: istio-ingressgateway
namespace: istio-system
spec:
type: LoadBalancer

For AKS clusters, we also need to disable the AKS admission enforcer from the Istio webhook. Navigate to common/istio-1–16/istio-install/base/install.yaml and add the following annotation at around line 2694.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: istio-sidecar-injector
annotations:
admissions.enforcer/disabled: ‘true’
labels:

Next, we will update the Istio Gateway to configure the Transport Layer Security (TLS) Protocol. This will allow us to access the dashboard over HTTPS. Navigate to the common/istio-1–16/kubeflow-istio-resources/base/kf-istio-resources.yaml and at the end of the file, at around line 14, paste the following contents:

tls:
httpsRedirect: true
— port:
number: 443
name: https
protocol: HTTPS
hosts:
— “*”
tls:
mode: SIMPLE
privateKey: /etc/istio/ingressgateway-certs/tls.key
serverCertificate: /etc/istio/ingressgateway-certs/tls.crt

Now we are ready to install Kubeflow. We will use kustomize to install the components with a single command. You can also install the components individually.

while ! kustomize build example | awk ‘!/well-defined/’ | kubectl apply -f -; do echo “Retrying to apply resources”; sleep 10; done

Note: This may take several minutes for all components to be installed and some may fail on the first try. This is inherent to how Kubernetes and kubectl work (e.g., CR must be created after CRD becomes ready). The solution is to simply re-run the command until it succeeds.

Once the components have been installed, verify that all of the pods are running by using:

kubectl get pods -A

Optional: If you created a new password for Kubeflow, restart the dex pod to ensure it is using the updated password.

kubectl rollout restart deployment dex -n auth

Finally, create a self-signed certificate for the TLS Protocol using the external IP address from the Istio load balancer. To get the external IP address, use the following command:

kubectl get svc -n istio-system

Create the Istio certificate and copy the contents below:

nano certificate.yaml
apiVersion: cert-manager.io/v1 
kind: Certificate
metadata:
name: istio-ingressgateway-certs
namespace: istio-system
spec:
secretName: istio-ingressgateway-certs
ipAddresses:
— <Istio IP address>
isCA: true
issuerRef:
name: kubeflow-self-signing-issuer
kind: ClusterIssuer
group: cert-manager.io

Then, apply the certificate:

kubectl apply -f certificate.yaml

Verify that the certificate was created successfully:

kubectl get certificate -n istio-system

Now we are ready to launch the Kubeflow Dashboard. To log into the dashboard, type the Istio IP address into your browser. When you first access the dashboard, you may get a warning. This is because we are using a self-signed certificate. You can replace this with an SSL CA certificate if you have one or click on Advanced and Proceed to the website. The DEX login screen should appear. Enter your username and password. The default username for Kubeflow is user@example.com and the default password is 12341234.

Summary

In this tutorial, we went over how to install Kubeflow on an Azure Kubernetes Services cluster with a confidential computing node pool. You are now ready to build and deploy scalable machine learning pipelines on Kubeflow. In the next tutorial, we will go over how to set up your Kubeflow Pipelines to ensure the pods are scheduled onto an Intel SGX VM node.

Additional Resources

  1. Move onto the next tutorial on setting up your Kubeflow pipeline here.
  2. Access the full source code on GitHub here.
  3. Register for Office Hours here for help on your implementation.
  4. Learn more about all of our Intel Cloud Optimization Modules here.
  5. Come chat with us on our Intel DevHub Discord server to keep interacting with fellow developers.

References

This work was re-published for Medium from Kelli Belcher’s article found here.

Want to Connect?

Get in touch with me on LinkedIn if you have any questions.

--

--

Benjamin Consolvo

AI Software Engineering Manager at Intel. I like to write on topics in AI to help other developers along their coding journey.