AKS vs reality

Tomáš Kukrál
6 min readJan 4, 2019

--

I’ve been using AKS (managed Kubernetes from Azure) for last few months to run some personal project and a few production workloads. I’ll share my operation experience in this article.

The most important advantage of AKS should be easy Kubernetes provisioning and lifecycle management (LCM). Basic assumption of managed Kubenetes means to be operated and managed by cloud provider, where you just consume the API. It means that you aren’t responsible for monitoring Kubernetes nodes, updates and day-to-day operation because it should be handled by Azure. It’s managed and you are paying for it. So forget about doing SSH to nodes and fixing something. If you break anything (like overloading node by undefined memory resources) than it should be reported back to you easily.

Ok, so you have deployed new Kubenetes cluster using terraform or web portal, and you are looking for link to download kubeconfig file. Save your time, because there is no link to download it. I was expecting similar experience like GKE — just click on download kubeconfig button and use standard kubectl command. Azure require az command installed on your machine to be able to generate kubeconfig and connect to you cluster. It is suprising for me that CLI, API and web portal provide different set of features.

Essential part of Kubenetes clusters is scaling, so you are able to change cluster size according to the requirements. Some managed Kubernetes clusters can autoscale according to allocated and requested pod resources, but this isn’t supported in AKS. I hope it will be possible in the future. However, missing autoscaling isn’t a huge problem for me because semi-manual scaling is posible by changing the cluster size.

Much more serious issue is how scaling works, because you can only change cluster size, which is more than suboptimal from my point of view. Let’s imagine this scenario: You are using AKS cluster with F2 instances (4GB RAM) to run CI jobs and regular CI job is allocating less than 2GB RAM (pod resources summary). It works fine until some job become more compilicated and will require pod with more than 4GB.

Scaling AKS cluster

There is no way to add different flavor to the cluster and you are locked to same instance flavour which was selected during cluster creation. GKE is solving this problem with node pools so any flavors can be mixed in cluster. Only way to add bigger instances to AKS cluster is to delete current cluster and redeploy with bigger instances. Not very cloud native, isn’t it?

Another problem is replacing failed node. It can fail due to your’s or Azure’s mistake and I was expecting to have easy way to destroy the node and start new one. I don’t care about local data since all our critical data must be external or stored on PVs so replacing node quickly while fixing root cause of problem is the best way to go. We have found dirty workaround to replace failed node by scaling to 1 node (which will delete all nodes expect the first one) and then scale back to your previous cluster size. However, there are two very unpleasant issues:

  1. Your cluster is underscaled during this operation and it will probably make cluster overloaded until you scale up to the previous cluster size
  2. First node can’t be replaced because scaling to zero nodes isn’t supported therefore you need to disable scheduling on first node to workaround it.

Let’s take a deep look how AKS nodes are configured. First we need SSH to the node. I was hopeing to avoid this (see first paragraph) but I’m not aware of any other method since Azure portal is very limited. There is a document describing the procedure. It basicaly starts pod in a cluster, installs openssh-clientinto the pod, generates SSH keys, puts private key into the pod and connects to node. It’s very complicated and uncomfortable compared to simple SSH button in GKE console.

Second way to reach faulty Kubernetes node is to assing public IP address and adjust security group to allow SSH connection to this particular node.

Docker is using default /var/lib/docker graph so be aware that images can fill up system disk. Fresh node have about 7GB used for filesystem. I’d like to have option to dedicate disk for docker but it ins’t possible. I’m expecting Azure to be monitoring filesystem usage and using proper settings for kubelet garbage collection to avoid filesystem fill up.

Filesystem      Size  Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 1.7M 393M 1% /run
/dev/sda1 30G 7.2G 22G 25% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sdb1 32G 48M 30G 1% /mnt

Docker logging configuration seems to be fine already:

{
"live-restore": true,
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}

There is a script /usr/local/bin/health-monitor.sh used for checking node health. The script was starting container every few seconds to check docker but it seems to be changed to different monitoring method. We have seen docker daemon to be restarted by this script in order to try heal the node which made situation even worse. However, this script seems to be adjusted in latest AKS and it works fine for now.

It’d be nice if AKS nodes can run ntp because there were clock skew problems repoted by our users. There is not ntp daemon running.

You can decide to run AKS with Advanced networking which provides more configuration options than the Basic networking. However, only 30 pods can be started per node due to their CNI design. This isn’t a problem because it’s stated in documentation and you should plan for this and choose right sizing of VM. In our case we had to deploy more smaller instances rather than larger to do not hit limit of pod IP addresses.

We have been using AKS cluster to run Gitlab executor and this kind of workload was a huge challenge for the cluster. Executor was starting many pods with more than five containers inside. We have experienced Kubernetes contol plane unavailable, nodes broken by filled up filesystem and broken connection between control plane and nodes. Error message bellow was observed countless times.

ERROR: Job failed (system failure): error dialing backend: dial tcp: lookup aks-agentpool-47932935-2 on 172.30.0.10:53: server misbehaving

AKS is using SSH tunnel between nodes and control plane. This tunnel is needed for all connections originated from control plane and targeted to nodes. It’s necessary, for example, for metric scraping, kubectl log, andkubectl exec . so SSH tunnel pretty critical part of cluster. It was necessary to restart the tunnel manually sometimes when connectivity was broken.

NAME                           READY     STATUS    RESTARTS   AGE
tunnelfront-994bb8445-klhb6 1/1 Running 0 8h

We have started some clusters in pre-GA and it got stuck when doing Kubenetes upgrade. We worked with support to fix this cluster to find a root cause for this problem and they seems to have some problem with Kubernetes control plane update (They are managing it using helm). This was what they were able to get

helm history 5ae1bff3edd7aa0001e96632
REVISION UPDATED STATUS CHART DESCRIPTION
1 Thu Apr 26 12:03:52 2018 SUPERSEDED kube-control-plane-0.2.0 Install complete2 Thu Apr 26 12:10:45 2018 FAILED kube-control-plane-0.2.0 Upgrade “5ae1bff3edd7aa0001e96632” failed: timed out wait…

Then backend team tried to rollback to different release and failed on release not found and recommended to recreate the cluster.

This certainly isn’t what you want to do with production cluster so I hope they have fixed all these critical issues before going to GA.

There is no changelog for AKS so the only way to follow changes in environment is your own investigation. This isn’t very convenient because cluster deployed tomorrow can have different settings than cluster deployed today. I don’t need to know git-like history for AKS deployments but it’s critical to have all changes affecting workload properly planned and announced beforehand.

We have seen incompatible changes in AKS API caused by adding required parameters which may break your deployment pipeline completely.

This is my personal experience running AKS clusters for last 9 months. AKS seems to be pretty nice service but very limited scaling capabilities and unclear service roadmap. We have switched most demanding workload (Gitlab runner) to GKE in August so I don’t have hands-on experince with latest updates. I’m sure that AKS team is working hard but they have a long way to catch GKE.

Please let me know if you are aware of any better solution for problems describe abowe. I’m looking forward to see this solved.

--

--