Photo by rawpixel on Unsplash

Current limitations running Azure AKS in production

A year ago, the company I freelance for has decided to go onto the cloud journey in partnership with Azure. With our project, we agreed to settle on using AKS as managed Kubernetes service because we don’t want to waste a lot of time setting up and managing clusters. 
Overall we are still very happy with the service, because running this on our own, would require more effort from colleagues taking care of scaling and updating and this would slow our development down. But even if we are happy we have seen some limitations, and that is what this article is all about.


1. Provisioning time:

Photo by Philipp Berndt on Unsplash

A cluster can be bootstrapped by Terraform, Azure Resource Templates, Azure CLI or in the web portal. After you submit the command, you have to take your time, because this process will take in average about 15 minutes until you can access your cluster. Steven Acreman from KubeDex has made a very detailed comparison of this issue (see here). This is especially a little bit annoying if you want to spin up a test cluster to evaluate some things. Also due to the fact, that new machines take also a lot of time, autoscaling is less effective because you have to have some buffer to reduce the impact of slow starting machines.


2. Persistent Volume mount time

An issue we are stumbling over a lot is, that the Azure persistent volume driver is taking sometimes longer than 7 minutes to reattach a persistent volume from one node to another after a pod of a StatefulSet has been removed. This is not happening each time, but often enough to delay update times. Maybe the AKS 1.12.4 update will resolve this. After we have updated, I will update this section.


3. Missing Network security policies

Photo by Philipp Berndt on Unsplash

Update: As is seems Network Security Policies are now available in Preview natively (See here). An important note is, that this seems only to work with newly created clusters.

Missing network security policies are a very big issue for us because we have right now just a single cluster and without them the only chance to isolate network zones is Istio. But due to the fact, that Istio has still some issues with StatefulSets and Kubernetes Health Checks it is a very difficult approach to embed the whole cluster into the mesh. The problem here is, that Azure AKS has it’s own Azure CNI driver to connect the cluster with Azure networking. This driver doesn’t support Kubernetes Network Policies right now and so you can’t use them. There is one blog post I found about this (see here), describing how azure-npm could be used to enable the feature on Azure AKS. Azure-npm is a plugin from the former Azure Container Services (ACS) that will run as a privileged Daemonset and has access to xtables.

- key: CriticalAddonsOnly
operator: Exists
nodeSelector:
beta.kubernetes.io/os: linux
containers:
- name: azure-npm
image: containernetworking/azure-npm:v1.0.13
securityContext:
privileged: true
env:
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
volumeMounts:
- name: xtables-lock
mountPath: /run/xtables.lock
- name: log
mountPath: /var/log

I will first try this out in a test cluster, to verify it doesn’t interfere with our current stuff. Also, Azure AKS already has marked, that they are working on Network Security Policies natively support (see updates here).


4. Update cluster duration

Updating your cluster in AKS may take a while because the nodes are updated one after another and from what I have seen, the update duration is 10 minutes per node. So if you run a 40 node cluster you will have to wait some hours for the update to go through. Also on our first approach updating to 1.12.4, it seemed like the API server crashed while updating and this led to an unresponsive cluster with no DNS, which was very annoying because no internal service communication was possible anymore. (This should only occur while updating to 1.12.4 because there is a switch from KubeDNS to CoreDNS as far as I have seen) 
The support fixed the issue on the API server and we ran the update again and got the same error some nodes later. The reason why we updated so early, was the hope this fixes the persistent volume issues from above. (See release notes Kubernetes 1.12.4)

Kubernetes 1.12.4 Release Notes

On Monday we will try the update again and if the persistent volume issue is gone, I will update this article.


5. Restriction to same node types (Just one pool)

Another very annoying issue is that AKS is currently bound to one node type. So if you select the recommended node type ( D2s v3–2 Cores — 7 GB RAM) you have to stick with it. If you need larger nodes, it is only possible to either spin up a second cluster. You can scale the VMs in your AKS, but after you update your cluster, the nodes will be reset. Azure is also working on supporting multiple node pools, but for now, we have to stick with what we got. This feature is the highest requested one on their ideas board. (See here)


Summary

Overall, our team likes the AKS service, for the last year it was very stable and we had no overall issues. But the farther we move on and the more we want to improve on our cluster system for production usage, the more limitations we see currently. Azure seems to work heavily on these, but for now, we have to live with them or provision our own cluster. I don’t want to bash Azure AKS with this article, I just want to show which limitations we have some struggles with to help others decide whether to take the managed service or provision their own cluster. Do you run Azure AKS in production? Are there any issues I forgot, or do you have a different opinion?