Current limitations running Azure AKS in production
Update November 5th, 2019:
Node Pools and Availability Zones are stable now. https://github.com/Azure/AKS/releases
Update May 6th, 2019: Since a couple of days, Azure released an update which unlocks Kubernetes 1.13.5 and brings running Node Pools as a preview feature. We had two incidents regarding persistent volumes since this writing (1.12 version), that were caused by errors between the Azure internal Kubernetes controller and the Kubernetes API. These incidents led to us being unable to deploy or delete anything with a persistent volume attached, because of inconsistencies between the two APIs. Solving these issues was only possible by searching through logs, manually draining some nodes and then detaching the volumes by hand. Azure support mentioned to me, that the persistent volumes claim bugs should be fixed with the new Kubernetes version 1.13.5, so I will take a look into it the upcoming weeks after all our clusters are upgraded. For now, we won’t enable the Node Pool feature, because it would require us to enable options account wide. Additionally Azure now supports Network Policies as a preview feature, but this has to be enabled at cluster creation. The preview features can be found here.
A year ago, the company I freelance for has decided to go onto the cloud journey in partnership with Azure. With our project, we agreed to settle on using AKS as managed Kubernetes service because we don’t want to waste a lot of time setting up and managing clusters.
Overall we are still very happy with the service, because running this on our own, would require more effort from colleagues taking care of scaling and updating and this would slow our development down. But even if we are happy we have seen some limitations, and that is what this article is all about.
1. Provisioning time:
A cluster can be bootstrapped by Terraform, Azure Resource Templates, Azure CLI or in the web portal. After you submit the command, you have to take your time, because this process will take in average about 15 minutes until you can access your cluster. Steven Acreman from KubeDex has made a very detailed comparison of this issue (see here). This is especially a little bit annoying if you want to spin up a test cluster to evaluate some things. Also due to the fact, that new machines take also a lot of time, autoscaling is less effective because you have to have some buffer to reduce the impact of slow starting machines.
2. Persistent Volume mount time
An issue we are stumbling over a lot is, that the Azure persistent volume driver is taking sometimes longer than 7 minutes to reattach a persistent volume from one node to another after a pod of a StatefulSet has been removed. This is not happening each time, but often enough to delay update times. Maybe the AKS 1.12.4 update will resolve this. After we have updated, I will update this section.
3. Missing Network security policies
Update: As is seems Network Security Policies are now available in Preview natively (See here). An important note is, that this seems only to work with newly created clusters.
Missing network security policies are a very big issue for us because we have right now just a single cluster and without them the only chance to isolate network zones is Istio. But due to the fact, that Istio has still some issues with StatefulSets and Kubernetes Health Checks it is a very difficult approach to embed the whole cluster into the mesh. The problem here is, that Azure AKS has it’s own Azure CNI driver to connect the cluster with Azure networking. This driver doesn’t support Kubernetes Network Policies right now and so you can’t use them. There is one blog post I found about this (see here), describing how azure-npm could be used to enable the feature on Azure AKS. Azure-npm is a plugin from the former Azure Container Services (ACS) that will run as a privileged Daemonset and has access to xtables.
- key: CriticalAddonsOnly
- name: azure-npm
- name: HOSTNAME
- name: xtables-lock
- name: log
I will first try this out in a test cluster, to verify it doesn’t interfere with our current stuff. Also, Azure AKS already has marked, that they are working on Network Security Policies natively support (see updates here).
4. Update cluster duration
Updating your cluster in AKS may take a while because the nodes are updated one after another and from what I have seen, the update duration is 10 minutes per node. So if you run a 40 node cluster you will have to wait some hours for the update to go through. Also on our first approach updating to 1.12.4, it seemed like the API server crashed while updating and this led to an unresponsive cluster with no DNS, which was very annoying because no internal service communication was possible anymore. (This should only occur while updating to 1.12.4 because there is a switch from KubeDNS to CoreDNS as far as I have seen)
The support fixed the issue on the API server and we ran the update again and got the same error some nodes later. The reason why we updated so early, was the hope this fixes the persistent volume issues from above. (See release notes Kubernetes 1.12.4)
On Monday we will try the update again and if the persistent volume issue is gone, I will update this article.
5. Restriction to same node types (Just one pool)
Another very annoying issue is that AKS is currently bound to one node type. So if you select the recommended node type ( D2s v3–2 Cores — 7 GB RAM) you have to stick with it. If you need larger nodes, it is only possible to either spin up a second cluster. You can scale the VMs in your AKS, but after you update your cluster, the nodes will be reset. Azure is also working on supporting multiple node pools, but for now, we have to stick with what we got. This feature is the highest requested one on their ideas board. (See here)
Overall, our team likes the AKS service, for the last year it was very stable and we had no overall issues. But the farther we move on and the more we want to improve on our cluster system for production usage, the more limitations we see currently. Azure seems to work heavily on these, but for now, we have to live with them or provision our own cluster. I don’t want to bash Azure AKS with this article, I just want to show which limitations we have some struggles with to help others decide whether to take the managed service or provision their own cluster. Do you run Azure AKS in production? Are there any issues I forgot, or do you have a different opinion?
Join our community Slack and read our weekly Faun topics ⬇