Creating a Linux and Windows Kubernetes Cluster
Author: W. Jenks Gibbons
For a foundational understanding of Windows containers and Kubernetes, see “An Introduction to Kubernetes on Windows”. To use Windows containers with Kubernetes we need a multi-os cluster. I will discuss how to create one in Azure Kubernetes Service (AKS). There are various ways to create an AKS cluster. Three ways I will discuss today are:
- The Azure portal: a web-portal for using Microsoft’s public cloud service
- The Azure CLI: the command line interface for Azure
- Terraform: infrastructure-as-code software
The Azure Portal
For users that are new to Azure, the portal may be the simplest way to get started creating a Linux cluster as it provides a graphical interface. These instructions however do not include those to create a Windows node pool as part of the cluster. Additionally, users will soon realize that this is not the most efficient way to create an AKS cluster. For that we need to look at one of the following methods.
The Azure CLI
I have found the Azure CLI to be a good solution. Once the CLI is installed we are ready to go.
Service Principal
Before we get to the CLI commands to create a cluster, we will create an Azure service principle object. This is what we will use to login to our Azure subscription so we can use the CLI. It requires no user input allowing for automation. In the Azure portal, in “Azure Active Directory (AD)”, add a new service principle noting the secret you create. To use it, go to your subscription -> Access Control (IAM) and add the contributor role to the service principle.
All commands here were run in bash on a Mac.
- The values you need to login below can be found in the Overview of the AD application you created.
#login to Azure using a service principle
az login --service-principal -u <Application (client) ID> -p <secret_from_client_credentials> --tenant <Directory (tenant) ID>
2. Create a resource group in the region where the cluster will be located.
- Resource group: “a container that holds related resources for an Azure solution”
- Regions/Locations: list Azure regions
az account list-locations -o table
#create resource group
az group create --name <resource_group_name> --location <region>
3. Create the Linux side of the cluster. The control plane only runs on Linux, therefore this will include one control plane node and any Linux worker nodes. NOTE: this example provides no ssh/rdp access to the hosts. This is a basic example that does not include many possible cluster configuration options.
- Resource group: see above
- Nodes: Linux nodes (control plane and any worker nodes)
- Kubernetes version: list Kubernetes versions for an Azure region/location
az aks get-versions — location <region/location> — output table
#create cluster
az aks create --resource-group <resource_group> --name <cluster_name> --node-count <nodes> --kubernetes-version <k8_version> --network-plugin azure
At this point, we have a Linux cluster with the number of nodes requested. To take a look we need credentials to use kubectl
. You can also look using “Cloud Shell” in the cluster in Kubernetes Services in the portal.
- Resource group: see above
- Cluster name: the name of the cluster
- -f: write kubeconfig to /tmp/config overwriting an old one
#get kubeconfig
az aks get-credentials --resource-group <resource_group> --name <cluster_name> -f /tmp/config --overwrite-existing
$ kubectl --kubeconfig /tmp/config get nodes \
> -o custom-columns="NODE NAME":.metadata.name,\
> "OPERATING SYSTEM":.status.nodeInfo.operatingSystem
NODE NAME OPERATING SYSTEM
aks-nodepool1-28168528-vmss000000 linux
Using the cluster in its current state, we can launch Linux pods. We still can’t launch Windows pods however. For this, we will need a Windows node pool.
- Resource group: see above
- Cluster name: name of the cluster
- Nodepool name: name of the Windows nodepool
- Node count: number of Windows worker nodes
#add a windows nodepool
az aks nodepool add --resource-group <resource_group> --cluster-name <cluster_name> --os-type Windows --name <nodepool_name> --node-count <nodes>
$ kubectl --kubeconfig /tmp/config get nodes \
> -o custom-columns="NODE NAME":.metadata.name,\
> "OPERATING SYSTEM":.status.nodeInfo.operatingSystem
NODE NAME OPERATING SYSTEM
aks-nodepool1-28168528-vmss000000 linux
akswnp000000 windows
When we no longer need the cluster, we can tear it down.
- Resource group: see above
#delete cluster
az group delete --name <resource_group> --yes --no-wait
Using the above commands, we can create a multi-os AKS cluster one line at a time. We can also put all the creation commands in a script:
#!/bin/bash
#export each of these $SERVICE_PRINCIPLE $CLIENT_SECRET $TENANT_ID
#so you can use the environment vars
#export SERVICE_PRINCIPLE=<AZURE_SERVICE_PRINCIPLE>
#name of resource group
resource_group="aks"
#az aks get-versions - location <region/location> - output table
azure_region="westus3"
#cluster name
cluster_name="aks"
#only control plane
linux_node_count=1
#example version
kubernetes_version=1.23.12
#name of windows node pool
windows_nodepool_name="wnp"
#one windows worker node
windows_node_count=1
#login to Azure using a service principle
az login --service-principal -u $SERVICE_PRINCIPLE -p $CLIENT_SECRET \
--tenant $TENANT_ID
#create resource group
az group create --name $resource_group --location $azure_region
#create cluster
az aks create --resource-group $resource_group --name $cluster_name \
--node-count $linux_node_count --kubernetes-version $kubernetes_version \
--network-plugin azure
#get kubeconfig and write to config in /tmp
az aks get-credentials --resource-group $resource_group --name $cluster_name \
-f /tmp/config --overwrite-existing
#add a windows nodepool
az aks nodepool add --resource-group $resource_group \
--cluster-name $cluster_name --os-type Windows --name $windows_nodepool_name \
--node-count $windows_node_count
$ kubectl --kubeconfig /tmp/config get nodes \
> -o custom-columns="NODE NAME":.metadata.name,\
> "OPERATING SYSTEM":.status.nodeInfo.operatingSystem
NODE NAME OPERATING SYSTEM
aks-nodepool1-28168528-vmss000000 linux
akswnp000000 windows
Terraform
The last solution we will look at is Terraform. This, in my experience, is the best of the three. This is a basic example to create a multi-os cluster. The example is based off of Jay Dave’s repository located here. A base understanding of Terraform is assumed here as I am going to simply call out several aspects of the repository and not go into how it was written or put together. For more information on the providers used see the documentation (e.g. azurerm provider).
To begin, let’s take a look at the files in the repository:
- helm*: these are helm values parameters for the Datadog Kubernetes agents. These are not required for creating the cluster. They are used for monitoring metrics, traces and logs and are out-of-scope for the current topic.
- main.tf: the main configuration for the module
- variables.tf: the variable definitions for the module
To start, we will examine the variables in variables.tf in more detail.
variable location {}
variable cluster_name {}
variable dns_prefix {}
variable datadog_api_key {}
variable statsd_host_port {}
variable jmx_datadog_agent {}
variable enable_win_node_pool {}
#https://portal.azure.com/#view/Microsoft_Azure_Billing/SubscriptionsBlade
variable subscription_id {
description = "Enter Subscription ID for provisioning resources in Azure"
}
#ad enterprise app
variable client_id {
description = "Enter Client ID for Application creation in Azure AD"
}
#ad enterprise app
variable client_secret {
description = "Enter Client secret for Application in Azure AD"
}
#https://portal.azure.com/#view/Microsoft_AAD_IAM/ActiveDirectoryMenuBlade/~/Overview
variable tenant_id {
description = "Enter Tenant ID / Directory ID of your Azure AD"
}
If the variables are not defined prior to running terraform apply
, terraform will ask for values from stdin
. To avoid this, variables can be defined in the file terraform.tfvars.
- variable location: the Azure region/location as noted above
- variable cluster_name: cluster name as noted above
- variable dns_prefix: prefix for DNS
- variable datadog_api_key, variable statsd_host_port, variable jmx_datadog_agent: Datadog agent related variables, out-of-scope for this topic
- variable enable_win_node_pool: boolean option to create a Windows node pool
- #https://portal.azure.com/#view/Microsoft_Azure_Billing/SubscriptionsBlade
- variable subscription_id: Azure subscription ID - variable client_id: the SERVICE_PRINCIPLE from above
- variable client_secret: the CLIENT_SECRET from above
- https://portal.azure.com/#view/Microsoft_AAD_IAM/ActiveDirectoryMenuBlade/~/Overviewvariable tenant_id
- variable tenant_id: the TENANT_ID from above
Next, we will examine the configuration for the main module from terraform.tf.
resource “azurerm_resource_group”
: this should look familiar from above, it is used to create the resource groupresource “azurerm_virtual_network"
andresource “azurerm_subnet”
: create the network and subnetresource “azurerm_kubernetes_cluster”
: this is used to create the control plane. As noted in “An Introduction to Kubernetes on Windows”, the control plane only runs on Linux. The optionnetwork_plugin = “azure”
is used, as noted here, if Windows Server-based nodes are required.- resource “azurerm_kubernetes_cluster_node_pool” “linux_node_pool”: creates a Linux node pool with
node_count
worker nodes. - resource “azurerm_kubernetes_cluster_node_pool” “win_node_pool”: creates a Windows node pool with
node_count
worker nodes.
- count = var.enable_win_node_pool ? 1 : 0: allows us to use the boolean variable discussed earlier to optionally create a Windows node pool. This option is placed conditionally as I rarely use Windows containers so I usually don’t need the additional resources provided by the node pool.
Now that we understand the files that comprise the module (NOTE: if you use this remove the helm* files if you don’t use Datadog or don’t want the agent installed), I will create a cluster using the repository mentioned earlier.
As noted here, terraform init
“initializes a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. It is safe to run this command multiple times.”
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Reusing previous version of hashicorp/helm from the dependency lock file
- Reusing previous version of hashicorp/azurerm from the dependency lock file
- Using previously-installed hashicorp/helm v2.7.1
- Using previously-installed hashicorp/azurerm v3.33.0
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
As noted here, terraform plan
“creates an execution plan, which lets you preview the changes that Terraform plans to make to your infrastructure.”
As noted above, if variables are not defined in a terraform.tfvars file you will need to provide the values using stdin
.
$ terraform plan
var.client_id
Enter Client ID for Application creation in Azure AD
Enter a value:
terraform.tfvars will look similar to the one below with the relevant values.
subscription_id =
client_id =
client_secret =
tenant_id =
location = "West US 3"
cluster_name = "gfse-cluster"
dns_prefix = "vgfse"
datadog_api_key = ""
statsd_host_port = ""
jmx_datadog_agent = ""
enable_win_node_pool = false
Because I will not install the Datadog agents, the helm.* files have been removed from my directory and datadog_api_key, statsd_host_port, jmx_datadog_agent
are empty as they are not needed. enable_win_node_pool = false
, therefore in the initial run there will be no Windows node pool.
As noted here, terraform apply
“executes the actions proposed in a Terraform plan.” -auto-approve
instructs “Terraform to apply the plan without asking for confirmation.”
The result will be a lot of output to stdout
such as the plan, what the plan will do, and the status.
$ terraform apply -auto-approve
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
<= read (data resources)
Terraform will perform the following actions:
# data.azurerm_kubernetes_cluster.credentials will be read during apply
# (depends on a resource or a module with changes pending)
<= data "azurerm_kubernetes_cluster" "credentials" {
+ aci_connector_linux = (known after apply)
...
Plan: 5 to add, 0 to change, 0 to destroy.
azurerm_resource_group.rg: Creating...
azurerm_resource_group.rg: Creation complete after 1s [id=/subscriptions/<id>/resourceGroups/gfse-resources]
azurerm_virtual_network.network: Creating...
azurerm_kubernetes_cluster.control_plane: Creating...
azurerm_virtual_network.network: Creation complete after 4s [id=/subscriptions/<id>/resourceGroups/gfse-resources/providers/Microsoft.Network/virtualNetworks/gfse-network]
azurerm_subnet.subnet1: Creating...
azurerm_subnet.subnet1: Creation complete after 4s [id=/subscriptions/<id>/resourceGroups/gfse-resources/providers/Microsoft.Network/virtualNetworks/gfse-network/subnets/gfse-subnet]
azurerm_kubernetes_cluster.control_plane: Still creating... [10s elapsed]
azurerm_kubernetes_cluster.control_plane: Still creating... [20s elapsed]
...
azurerm_kubernetes_cluster_node_pool.linux_node_pool: Still creating... [3m30s elapsed]
azurerm_kubernetes_cluster_node_pool.linux_node_pool: Creation complete after 3m36s [id=/subscriptions/<id>/resourceGroups/gfse-resources/providers/Microsoft.ContainerService/managedClusters/gfse-cluster/agentPools/linux]
Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
Based on the configuration we reviewed earlier, we should have a cluster with a control plane, one Linux worker node and no pods in the default namespace. To view this we need the kubeconfig.
$ kubectl --kubeconfig ./config get nodes \
> -o custom-columns="NODE NAME":.metadata.name,\
> "OPERATING SYSTEM":.status.nodeInfo.operatingSystem
NODE NAME OPERATING SYSTEM
aks-default-15032064-vmss000000 linux
aks-linux-42883043-vmss000000 linux
$ kubectl --kubeconfig ./config get pods
No resources found in default namespace.
We can run this again and increase the worker nodes and add a Windows node pool by changing:
node_count = 2
in the resource “azurerm_kubernetes_cluster_node_pool” “linux_node_pool”
and
enable_win_node_pool = false
to enable_win_node_pool = true
The output will show the changes:
terraform apply -auto-approve
...
Plan: 1 to add, 1 to change, 0 to destroy.
...
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.
Let’s look at the updates:
$ kubectl --kubeconfig ./config get nodes \
> -o custom-columns="NODE NAME":.metadata.name,\
> "OPERATING SYSTEM":.status.nodeInfo.operatingSystem
NODE NAME OPERATING SYSTEM
aks-default-15032064-vmss000000 linux
aks-linux-42883043-vmss000000 linux
aks-linux-42883043-vmss000001 linux
akswin000000 windows
As you can see, we now have an additional Linux worker node and one Windows worker node.
To shut the cluster down pass the -destroy
flag to apply
$ terraform apply -auto-approve -destroy
...
Apply complete! Resources: 0 added, 0 changed, 6 destroyed.
As this point, we have reviewed several different ways to create a multi-os AKS cluster.
If you have questions about how to schedule Windows pods see “An Introduction to Kubernetes on Windows”.
Have fun!
Thanks to Mike Harner for reviewing.