PSA: Update your Azure Kubernetes Service (AKS) service principal’s password

A tale of vague documentation and service outages

Damien Retzinger
Graycore Engineering
6 min readAug 12, 2019

--

The situation

I had just stepped out for dinner when my phone started buzzing like mad:

Incident Opened, Error percentage (More than none), Error percentage > 0% for at least 5 minutes

Followed by:

[Outage Monitor] www.website.com

My fiance sighed and I turned the car around.

The false promise of fewer outages

I’m not a fan of unpleasant surprises, especially when it comes to cloud services. The whole premise of moving to cloud services is the off-loading of infrastructure work from the in-house enterprise to the “professionals” at the Infrastructure-as-a-Service (IaaS) cloud provider so that you can reduce cost, save time, and lower risk. One of the many ways that cloud services help to lower risk is the removal of many of the “single points of failure” that exist in home-grown infrastructure. IaaS Providers do this in a variety of well-engineered ways:

  1. Virtual Machines (ephemerality)
  2. In-region zone redundancies
  3. Multi-regional deployments

But, what happens when the IaaS Provider creates its own intentional single points of failure?

The IaaS provider under discussion here is Microsoft’s Azure, specifically Azure Kubernetes Service (AKS). Built on top of Google’s open-source Kubernetes project, AKS has steadily grown in popularity since its General Availability in June of 2018. While AKS has gone through its own fair share of outages, issues, and growing pains, the service has been generally good. Unfortunately, this is not a tale of “good” service.

The Symptoms

Reviewing the error log associated with the previously received alerts indicated issues with connectivity not only inside the AKS cluster and but also externally between the cluster and the rest of the world. Some alerts involved failures to connect to an Azure Database for MySQL instance, others referred to connectivity problems between core application pods in the cluster. The first thought that occurred was: “There may be another Azure DNS outage”. After checking the Azure status page to see if the issue was cluster-side or Azure-side, it was determined that the issue must exist in the cluster; there were no Azure status alerts. Additionally, two things were noted in the cluster:

  1. A larger than anticipated number of a certain type of job was stuck in an ImagePullBackOff state.
  2. A larger than anticipated number of nodes existed.

The diagnostic procedure

A kubectl get pods returned something like this:

A list of 11 pods that in the `ImagePullBackoff` state.
11 pods in the `ImagePullBackOff` state

And upon describing one of these pods, a message like below was returned:

Failed to pull image “myacr.azurecr.io/my-image:mytag”: rpc error: code = Unknown desc = Error response from daemon: Get https://myacr.azurecr.io/v2/my-image/manifests/my-tag: unauthorized: authentication required

Additionally, a kubectl get nodeslooked like:

A list of 17 kubernetes node statuses
17 Kubernetes nodes, some in the `Ready` some in the `NotReady` state. The cluster has an enforced max of 14 nodes.

Some important background:

  1. This cluster normally has exactly 8 nodes running
  2. A cluster-autoscaler is configured to scale up new nodes when there are jobs that need hardware to run on.
  3. The cluster auto-scaler has a limit of 14 nodes that it can scale up to, once it reaches this limit, it will stop scaling up nodes.

There are a few things to notice:

  1. There’s a sudden problem with pulling images.
  2. There’s a larger number of nodes listed than should be allowed.
  3. Some of the nodes listed in the kubectl get nodes don’t exist in Azure Resource Manager.

The first attempted fix

The first attempted fix was to try and walk through the AKS and ACR tutorial. Potentially, there was a permission API bug or, perhaps more likely, someone deleted the permission in Azure while reviewing user permissions. After spending some time looking into the service principal’s permissions, the permissions seemed properly configured. The service principal in question had both “Reader” and “ACRPull” on the necessary ACR.

The second attempted fix

Next, to reduce the number of nodes that were being spun up, old jobs were manually deleted: k8s delete job job-name. As they were deleted, behind the scenes, the cluster-autoscaler started tearing down nodes as the pressure on the cluster was alleviated. Surprisingly, as both jobs and nodes were removed, normal functionality was restored to the cluster and the network issues vanished.

The solution

It is important to note that the node count in Azure dropped, but the node count in the kubectl get nodes didn’t.

This, along with the lack of images pulling, are two strong indicators that somehow the cluster is failing to communicate with Azure.

This lead to the realization that something was up with the cluster’s service principal. It wasn’t permissions, but something else. After reading through the docs, I happened to come across this short sentence:

By default, AKS clusters are created with a service principal that has a one-year expiration time. As you near the expiration date, you can reset the credentials to extend the service principal for an additional period of time. You may also want to update, or rotate, the credentials as part of a defined security policy. This article details how to update these credentials for an AKS cluster. — https://docs.microsoft.com/en-us/azure/aks/update-credentials

At this point, confidence that this was the problem was near 99.9%. Following along with the docs, the credentials for the service principal were reset. Next, after attempting to update the cluster via:

Deployment failed. Correlation ID: 4a2baea3-c6c6–4f89–8abf-eff8900455ce. Internal server error

To add insult to injury, this failed after a grueling 15 minutes. Luckily, this seemed to only be a partial failure: the token was actually reset, images started pulling and the Azure Resource and kubectl get nodes fell back in sync.

The Summary

To summarize, this outage occurred as a result of the expiration of an authorization token that was haphazardly mentioned exactly once at the end of a paragraph in the starter documentation of AKS.

When you create a brand new AKS service, your cluster is assigned a new service principal (a special user) that is used to interact with all of the other Azure Services. This service principal does a variety of things, but one of its most common uses is to pull container images from the complimentary service to AKS, Azure Container Registry (ACR). When you first get started with AKS and ACR, you are directed to this introductory article on how to pair the two. It is at this point that the outage time-bomb was unwittingly created. When created, the service principal comes with a token expiration date of exactly one year from creation.

The Solution

Documentation is absolutely the problem as well as the solution. This is a UX problem. Users are lazy, they don’t read things. Technical users reading technical documentation are especially lazy. This is not something worthy of a single sentence at the end of a paragraph. This needs a big warning sign that says:

Hey you! Your token expires in a year. If you don’t think about this, your infrastructure will break and you will ruin one of your days in the future.

Additionally, and this would be “nice-to-have”, automated emails indicating the proximity to expiration would be useful as well.

The experience of others

  1. A stackoverflow post
  2. Azure’s Own Feedback Forums
  3. A Github Issue describing how this problem manifests in another context
  4. An article about the EXACT SAME PROBLEM in ACS (the “precursor” product to AKS)

--

--