PSA: Update your Azure Kubernetes Service (AKS) service principal’s password
A tale of vague documentation and service outages
The situation
I had just stepped out for dinner when my phone started buzzing like mad:
Incident Opened, Error percentage (More than none), Error percentage > 0% for at least 5 minutes
Followed by:
[Outage Monitor] www.website.com
My fiance sighed and I turned the car around.
The false promise of fewer outages
I’m not a fan of unpleasant surprises, especially when it comes to cloud services. The whole premise of moving to cloud services is the off-loading of infrastructure work from the in-house enterprise to the “professionals” at the Infrastructure-as-a-Service (IaaS) cloud provider so that you can reduce cost, save time, and lower risk. One of the many ways that cloud services help to lower risk is the removal of many of the “single points of failure” that exist in home-grown infrastructure. IaaS Providers do this in a variety of well-engineered ways:
- Virtual Machines (ephemerality)
- In-region zone redundancies
- Multi-regional deployments
But, what happens when the IaaS Provider creates its own intentional single points of failure?
The IaaS provider under discussion here is Microsoft’s Azure, specifically Azure Kubernetes Service (AKS). Built on top of Google’s open-source Kubernetes project, AKS has steadily grown in popularity since its General Availability in June of 2018. While AKS has gone through its own fair share of outages, issues, and growing pains, the service has been generally good. Unfortunately, this is not a tale of “good” service.
The Symptoms
Reviewing the error log associated with the previously received alerts indicated issues with connectivity not only inside the AKS cluster and but also externally between the cluster and the rest of the world. Some alerts involved failures to connect to an Azure Database for MySQL instance, others referred to connectivity problems between core application pods in the cluster. The first thought that occurred was: “There may be another Azure DNS outage”. After checking the Azure status page to see if the issue was cluster-side or Azure-side, it was determined that the issue must exist in the cluster; there were no Azure status alerts. Additionally, two things were noted in the cluster:
- A larger than anticipated number of a certain type of job was stuck in an
ImagePullBackOff
state. - A larger than anticipated number of nodes existed.
The diagnostic procedure
A kubectl get pods
returned something like this:
And upon describing one of these pods, a message like below was returned:
Failed to pull image “myacr.azurecr.io/my-image:mytag”: rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/my-image/manifests/my-tag: unauthorized: authentication required
Additionally, a kubectl get nodes
looked like:
Some important background:
- This cluster normally has exactly 8 nodes running
- A cluster-autoscaler is configured to scale up new nodes when there are jobs that need hardware to run on.
- The cluster auto-scaler has a limit of 14 nodes that it can scale up to, once it reaches this limit, it will stop scaling up nodes.
There are a few things to notice:
- There’s a sudden problem with pulling images.
- There’s a larger number of nodes listed than should be allowed.
- Some of the nodes listed in the
kubectl get nodes
don’t exist in Azure Resource Manager.
The first attempted fix
The first attempted fix was to try and walk through the AKS and ACR tutorial. Potentially, there was a permission API bug or, perhaps more likely, someone deleted the permission in Azure while reviewing user permissions. After spending some time looking into the service principal’s permissions, the permissions seemed properly configured. The service principal in question had both “Reader” and “ACRPull” on the necessary ACR.
The second attempted fix
Next, to reduce the number of nodes that were being spun up, old jobs were manually deleted: k8s delete job job-name
. As they were deleted, behind the scenes, the cluster-autoscaler started tearing down nodes as the pressure on the cluster was alleviated. Surprisingly, as both jobs and nodes were removed, normal functionality was restored to the cluster and the network issues vanished.
The solution
It is important to note that the node count in Azure dropped, but the node count in the kubectl get nodes
didn’t.
This, along with the lack of images pulling, are two strong indicators that somehow the cluster is failing to communicate with Azure.
This lead to the realization that something was up with the cluster’s service principal. It wasn’t permissions, but something else. After reading through the docs, I happened to come across this short sentence:
By default, AKS clusters are created with a service principal that has a one-year expiration time. As you near the expiration date, you can reset the credentials to extend the service principal for an additional period of time. You may also want to update, or rotate, the credentials as part of a defined security policy. This article details how to update these credentials for an AKS cluster. — https://docs.microsoft.com/en-us/azure/aks/update-credentials
At this point, confidence that this was the problem was near 99.9%. Following along with the docs, the credentials for the service principal were reset. Next, after attempting to update the cluster via:
az aks update-credentials \
> — resource-group $AKS_RESOURCE_GROUP \
> — name $AKS_CLUSTER_NAME \
> — reset-service-principal \
> — service-principal $AKS_SP_ID \
> — client-secret $AKS_SP_SECRET
Deployment failed. Correlation ID: 4a2baea3-c6c6–4f89–8abf-eff8900455ce. Internal server error
To add insult to injury, this failed after a grueling 15 minutes. Luckily, this seemed to only be a partial failure: the token was actually reset, images started pulling and the Azure Resource and kubectl get nodes
fell back in sync.
The Summary
To summarize, this outage occurred as a result of the expiration of an authorization token that was haphazardly mentioned exactly once at the end of a paragraph in the starter documentation of AKS.
When you create a brand new AKS service, your cluster is assigned a new service principal (a special user) that is used to interact with all of the other Azure Services. This service principal does a variety of things, but one of its most common uses is to pull container images from the complimentary service to AKS, Azure Container Registry (ACR). When you first get started with AKS and ACR, you are directed to this introductory article on how to pair the two. It is at this point that the outage time-bomb was unwittingly created. When created, the service principal comes with a token expiration date of exactly one year from creation.
The Solution
Documentation is absolutely the problem as well as the solution. This is a UX problem. Users are lazy, they don’t read things. Technical users reading technical documentation are especially lazy. This is not something worthy of a single sentence at the end of a paragraph. This needs a big warning sign that says:
Hey you! Your token expires in a year. If you don’t think about this, your infrastructure will break and you will ruin one of your days in the future.
Additionally, and this would be “nice-to-have”, automated emails indicating the proximity to expiration would be useful as well.