Using Aliyun GPU Share in an Azure AKS
(NOTE: Al files to support this article can be found in this Github repository)
I faced the need to use the Aliyun Scheduler Extender for the ability to run several PODs requesting GPU resources in an AKS Cluster. I had a hard time getting it to work, even though I was following this guide that explains well a process, this did not result and I was wondering why it was published as a successful guide when it doesn’t work.
Well at the end I hit the wall that, being AKS a managed K8S service, there are things that should be done differently, in this case, we cannot deal with the default scheduler configuration and rather we have to add a secondary scheduler and use it to deal with the PODs requesting GPU resources.
The Aliyun Scheduler extender requires three main components to work:
- A Custom Scheduler capable of schedule PODs requesting aliyun.com/gpu-mem resources.
- The Scheduler Extender responsible for determining whether a single GPU device on the node can provide enough GPU Memory when the global scheduler Filter and Bind, and records the GPU allocation result to the Pod Spec Annotation for subsequent filter at the time of Bind.
- The Device Plugin responsible for the allocation of the GPU device according to the decision of the GPU Share Scheduler Extender recorded on the Pod Spec Annotation.
For more information please refer to Aliyun documentation.
If you’re looking for a quick spoiler on how the story ends I can tell you there are two approaches: one is enabling SSH access to your worker nodes and adding a static POD in /etc/kubernetes/manifests with a configuration file for the scheduler and the scheduler.conf acting as the kubeconfig, and the other is creating a Deployment for the scheduler POD with some Service Accounts that allow the custom scheduler to access the K8S API.
If you follow the above mentioned guide, be aware that the command to instantiate the schduler has the leader-elect parameter to true, thus causing this secondary scheduler to fail getting the lease (AKS always has the leadership) and being inoperable within the cluster:
Also, setting a scheduler name and using it in the POD manifests result of value for a proper operation.
Now, for the patients willing to see the whole story in development, let’s now get hands dirty putting the Aliyun to work!.
Beginning with the basics: getting an AKS cluster
In this repository I am leaving a basics terraform code to deploy an AKS cluster, the explanation and further details of it are out of the scope of this article, but I will just mention that you can, after cloning the repo, cd into the terraform folder an run the commands (not all are necessary):
terraform init
terraform validate
terraform plan
terraform apply -auto-approve
This must get you a resouces group named aliyun-test and an AKS cluster named aliyun-aks, you can change those names and other parameters like the location, which defaults to centralus, in the variables file.
After some 10 minutes you should have the cluster running:
Observe also that: (1) You need a Service Principal with proper privileges over your Azure subscription to run this code, or you can choose to run the script with an az login made in your bash instance by removing the service principal parameters from the azurerm block. (2) You must have quota for the type of VM Standard_NC4as_T4_v3 and you need this one or any other featuring an nVidia GPU, otherwise there’s no point on trying to install Aliyun is there?. (3) Beware that GPU VMs do cost money, much more that the other VMs. (4) A VM is created and placed in the same VNET of the cluster, we use this VM as a Jumpbox to manage the cluster, this gets installed with kubectl and a nice plugin that shows current context and namespace in the command line, you must have SSH keys to connect to it after the VM is created, you’ll need to be in the same VM’s VNET if you want to access the worker nodes via SSH. (5) The provided code does not follow any good practices but as “Master Jedis” and/or “Rockstars” aren’t likely to require relying on articles like this so we’re good ;).
Now, you need a connection to the recently created cluster, if you want a direct connection using the “connect” button of the AKS cluster you won’t be able to SSH to nodes, thus you will not be able to do the first approach of putting some files into the cluster and you must jump to Aliyun: The easy way, if you’re interested in the static POD approach just keep going.
Now, let’s first connect to the created jumpbox that uses an SSH key you must have:
It’s also out of scope of this article to get into details of how to ssh into a VM, but, once you’ve gotten the Jumpbox public IP address, that we can call JUMPBOX_PUBLIC_IP, you can run the following command (we’re assuming here you have a valid SSH key in ~/.ssh/aliyun_rsa and used that one to configure the VMs):
SSH -i ~/.ssh/aliyun_rsa admiunser@JUMPBOX_PUBLIC_IP
Once you run that, if successfull, you should view the following:
Observe that the plugin shows the current context and namespace (aliyun-aks and default respectively). From here, you should be able to run kubectl commands:
Note also that we’ve defined an alias for kubectl as k which is ofently used, we’ll use this from now on for agility in the command line use. Let’s now get into the deal with Aliyun.
Aliyun: Let’s do it the hard way first
Now that we have the AKS cluster and the respective access to it, we can proceed with the Aliyun Scheduler Extender installation, for this hard way of doing it we require SSH access to the worker nodes to be able to place files within the Linux FileSystem.
You need to perform the following commands from a terminal where you can perform an az login, you can do it from the Jumpbox but you won’t see a browser window opening automatically as this machine does not have a GUI, so if you want to perform the az login from the Jumpbox you will have to manually copy the provided URL and paste the provided code in a browser:
If you decide to do this from your local computer you can follow the usual way.
Now, to enable SSH connection to AKS modules I followed this guide, which can be summarize with the following commands:
CLUSTER_INFRA_RESOURCE_GROUP=$(az aks show --resource-group "aliyun-test" --name "aliyun-aks" --query nodeResourceGroup -o tsv)
SCALE_SET_NAME=$(az vmss list --resource-group "$CLUSTER_INFRA_RESOURCE_GROUP" --query "[?contains(name, 'default')].name | [0]" -o tsv)
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/aliyun_rsa
az vmss extension set --resource-group "$CLUSTER_INFRA_RESOURCE_GROUP" --vmss-name "$SCALE_SET_NAME" --name VMAccessForLinux --publisher Microsoft.OSTCExtensions --version 1.4 --protected-settings "{\"username\":\"azureuser\", \"ssh_key\":\"$(cat ~/.ssh/aliyun_rsa.pub)\"}"
az vmss update-instances --instance-ids '*' --resource-group "$CLUSTER_INFRA_RESOURCE_GROUP" --name "$SCALE_SET_NAME"
The first two lines get the name of the Infrastructure resources group created for the AKS cluster and the name of the Virtual Machine Scale Set (VMSS) created for the cluster’s workers, you can get those values from the portal too.
The third line creates an SSH key that will be injected into the VMSS in the next two lines where the Microsoft.OSTCExtension extension gets installed. Observe that again the SSH key aliyun_rsa appears here, it does not have to be the same as the one used for the Jumpbox, actually the following commands create a new one because we did this from the Jumpbox, but if you do it from the local computer you can use the same SSH key you already have.
I recommend you to run the commands one by one and verify that the values you get for the cluster parameters are correct:
Now, let’s SSH into one AKS node, for this we can run the command k get no -owide and get IP of the node, then run the following command:
k get no -owide
ssh -i ~/.ssh/aliyun_rsa azureuser@10.1.1.4
If course 10.1.1.4 is an IP in my cluster, in yours ut may difer.
Ok, now we’re inside the VM that acts as worker node for this cluster, woohoo!, observe that in AKS there’s no such thing as master node, I mean let me explain: there has to be a control plane of course and the master node do exists, but those are provided and managed by the Azure cloud and you cannot see/access it, and you do not get charged (at least nominally) for it.
Now, let’s see the PODs in the kube-system namespace where usually the K8S related workloads run and where we’ll place our scheduler:
Let’s now manually “install” the scheduler for aliyun.com/gpu-mem resources. BTW, those resources are GB of GPU memory that a POD can request, can be only integer (there’s no such thing a 1 milli-aliyun) and the maximum value is the GB RAM of your particular GPU.
Now, first thing you can do is to copy the kubeconfig that has been installed in the Jumpbox to the worker node using SCP command, for this we’ll copy the config, certificate-authority, client-certificate and client-key files from the .kube folder into the azureuser’s home in the worker node you have SSH connection to:
Three things here: (1) we’re using the kubeconfig that is generated with a Service Principal during the Jumpbox provisioning because we do not have access to the full set of keys to generate more authentication certs, of course you can create a certificate request and sign it as a K8S object but that’s another step added to the clumsiness. (2) The certs must be in the same dir of the config file (we’ll rename it to scheduler.conf) or you will have to edit it accordingly. (3) We had SSH connection before to the worked but you need to be in the Jumpbox to SCP into the node.
Now we must see the copied files in the node’s home:
And we’ve also created the kube-scheduler.yaml and scheduler-policy-config.yaml files that we’ll use to run the scheduler. You can either GET those files with CURL or fill the content with VIM from here.
Now, copy the config file and corresponding certs to /etc/kubernetes directory with the following commands:
sudo cp certificate-authority client-certificate client-key config /etc/kubernetes/
cd /etc/kubernetes/
sudo mv config scheduler.conf
Observe that we’ve renamed the config file to scheduler.conf file as that is the reference made within the kube-scheduler.yaml file.
Copy also the scheduler-policy-config.yaml file to /etc/kubernetes file and the kube-scheduler.yaml file to the /etc/kubernetes/manifests directory:
sudo cp scheduler-policy-config.yaml /etc/kubernetes/
sudo cp kube-scheduler.yaml /etc/kubernetes/manifests/
You must see now the custom scheduler POD running:
Ok that was hard uh? 😅now let’s make this easy
Aliyun: The easy way
NOTE: If you already went through the hard way you don’t need to create the custom scheduler again, if you want to test the easy way you should delete the file /etc/kubernetes/manifiests/kube-scheduler.yaml file to avoid conflicts.
You may want to continue with the steps of installing the Scheduler Extender, Device Plugin DaemonSet and Device Plugin RBAC components if you followed the hard instructions to setup the Aliyun scheduler, however, for those seeking an easy way to get the scheduler working I’ve got it all covered.
In the Github repository with support files for this article you’ll find the aliyun-as-deployment.yaml file that contains manifests to create a set of K8S resources to run the Aliyun scheduler, to make it you just run this command:
k create -f https://raw.githubusercontent.com/dsatizabal/aliyun-aks/main/manifests/aliyun-as-deployment.yaml
Observe the results before and after running those commands:
Neato! no more banana leaves! we can now get the Aliyun scheduler running in a soft n’ easy way very quickly!. I recommend you to take a look at the logs of the created POD for the Aliyun scheduler, you may find useful info as we’ve set the verbosity of logs to 4.
Before moving on a logical question someone may ask is: why is it that you spend so much time “SSH-eing” into the nodes, putting static PODs if it’s so easy to run the Custom Scheduler as a daployment? Well two thigs: (1) At the beginning I must say I was not aware it was this easy. (2) Someone may want to use static PODs instead of deployments.
Ok, now let’s wrap up and see the Aliyun working. 😉
Installing scheduler extender, Device Plugin and RBAC components
We need now to install the Scheduler Extender POD and the Device Plugin for Aliyun, just run the following commands:
k create -f https://raw.githubusercontent.com/dsatizabal/aliyun-aks/main/manifests/gpushare-schd-extender.yaml
k create -f https://raw.githubusercontent.com/dsatizabal/aliyun-aks/main/manifests/device-plugin-rbac.yaml
k create -f https://raw.githubusercontent.com/dsatizabal/aliyun-aks/main/manifests/device-plugin-ds.yaml
It doesn’t matter the order in which you run the above-provided commands, the result will be the following:
Now we’re ready to test the Scheduler Extender expecting that we get PODs requesting aliyun.com/gpu-mem reources properly scheduled on the GPU node. Run the following commands:
k config set-context --current --namespace=default
k get po
k create -f https://raw.githubusercontent.com/dsatizabal/aliyun-aks/main/manifests/aliyun-test.yaml
k get po
Observe that we’ve switched namespaces to default and there are not PODs running, but after creating the test POD and waiting some minutes we see the Aliyun POD running:
Moreover, if you get the YAML of the test POD you’ll see the annotations that the Aliyun scheduler extender adds to it:
Note also that, for this particular POD we’ve set the schedulerName parameter to Aliyun which is the name set for our created scheduler:
Conclusions
Ok, it has been kind of a long read up until here, I wanted to expose the whole process I faced to get this running, at the beginning I had to struggle with putting the static POD running, but later I came across the easier way to run the Scheduler as a deployment and then able to get the Scheduler Extender running.
One final observation is that, in our example the AKS cluster has a single node with GPU, but you may find use cases where there may be different nodepools and you need to precisely schedule PODs in GPU-enabled nodes.
So, that’s all folks!, any comments are welcome and I hope you’ve found this article useful!