Debugging and Fixing Deployments on Kubernetes: a tale of Helm and Tiller on the Alibaba Cloud

How to deal with ErrImagePull(s), ImagePullBackOff(s), Great Firewalls, mirrors, lack of information and live happily ever after

Edoardo Nosotti
Apr 15, 2020 · 6 min read
Photo by Hosea Georgeson on Unsplash

Helm is a widely used package manager for Kubernetes. It simplifies the deployment of many common applications and it also helps to create deployment packages for complex custom applications with dependencies. So installing Helm is one of the first steps many admins and DevOps engineers take during the deployment of a new Kubernetes cluster.

I have deployed several Kubernetes clusters now and Helm used to work out of the box, until I deployed one in China Mainland. The Helm installation failed consistently, but I eventually worked out a solution. The steps I took to debug and fix the issue might as well apply to several other scenarios, so I am sharing this story with you.

When helm init runs it deploys Tiller, a server component used by Helm to remotely operate on the Kubernetes cluster. Tiller runs in a pod like every other application and it is placed in the kube-system namespace, together with other base services of the cluster.

So the first step to debug an installation gone wrong is to run the kubectl get pod command and check the status of the pod(s) being deployed:

$ kubectl get pod -n kube-systemNAME                        READY   STATUS
...
tiller-deploy-{random-id} 0/1 ImagePullBackOff

Tip: do not forget to specify the namespace with -n if you are not targeting the default namespace, or use handy tools such as kubens to avoid typing namespaces all the time.

The status flags ImagePullBackOff and ErrImagePull reveal that something went wrong at the very beginning of the installation: getting the desired Docker image from its source, the so called registry.

This can happen for a number of possible reasons:

  • A wrong registry address, image name or version have been specified
  • The Kubernetes cluster was not provided with a valid set of credentials to pull the image from the registry, in case it is private
  • The network configuration (routing, firewall rules, or the “Security Group” on public clouds) for the Kubernetes cluster is blocking the outbound connection to the registry
  • The outbound connection to the registry is blocked at higher level, possibly out of your control

To dig further into the issue I ran kubectl describe pod to try and figure out what exactly went wrong with this installation:

$ kubectl describe pod tiller-deploy-{random-id} -n kube-system...Events:Type     Reason     Age     From     Message
---- ------ ---- ---- -------
...Warning Failed 7m23s (x14 over 57m) kubelet, cn-{region}.{private-ip} Failed to pull image "gcr.io/kubernetes-helm/tiller:{version}": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)Normal Pulling 2m18s (x15 over 57m) kubelet, cn-{region}.{private-ip} Pulling image "gcr.io/kubernetes-helm/tiller:{version}"

So the cluster was not able to pull the image from the gcr.io registry. This did not come as a surprise indeed. Domain gcr.io belongs to the Google Cloud and many other Google services have been blocked in China over time.

Elementary, my dear Watson!”
— Sherlock Holmes

In some other cases the reason(s) why your cluster is not able to pull images might not be as obvious. So if a deployment fails, I suggest to try and deploy a common application from the Docker Hub first.

Nginx is a good starting point:

$ kubectl create deployment --image nginx test-nginx

The command above will create a new deployment, named “test-nginx”, and pull the standard Nginx image from the Docker Hub. If it works, then your cluster and is working and its network rules allow it to at least access the public network.

The easiest, one-size-fits-all fix for such an issue would be to mirror the image to an own registry. The Alibaba Cloud offers the Container Registry service to host your own images. Being part of the Alibaba Cloud platform itself, their registry is easily accessible from clusters running in their network.
Yet, this approach adds moving parts, maintenance effort and also (very small, indeed) resource costs to the solution.
The Alibaba Cloud also provides an automated image-syncher which takes away the effort keeping images up-to-date with their sources, but deploying it would only shift the additional moving parts, maintenance effort and costs to something else. So I went looking for a better option.

When you adopt a “turnkey” Kubernetes solution from a major cloud provider you are likely to get a custom implementation of Kubernetes, such as the Azure AKS or the Alibaba Container Service, rather than its reference code. Likewise, the cloud vendors might offer customized or mirrored versions of some common components. So it is worth looking for provider-specific resources first, when deploying a Kubernetes cluster.

Searching for “alibabacloud Tiller” I found an official Alibaba Cloud documentation page which pointed me in the right direction. Alibaba has its own mirrors of Tiller images. They maintain them for their customers, they are free to use and spread across several regions to improve reachability and performance. The documentation though is pretty outdated (2018), at the time of writing* it suggests a very old version of Tiller and does not properly highlight two very important points:

  • from the cluster nodes, you can pull Docker using internal VPC addresses taking advantage of the private and efficient Alibaba network. The code example shows registry.cn-hangzhou.aliyuncs.com, but carefully reading the page you can find that registry-vpc.{REGION_NAME}.aliyuncs.com endpoints are also available.
  • an old version of Helm and Tiller is mentioned and no info on how to determine the latest available version are provided

(*) here is a snapshot of the current version of the page for future reference

tl;dr, how to fix the Tiller installation

I could not find further information on the Tiller versions available on the Alibaba Cloud, but having a publicly accessible endpoint of their Docker registry and using the tags in the official Helm code repository I cloud try pulling the images on my machine and went backwards until I found one available:

$ docker pull registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.16.5Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.16.5 not found: manifest unknown: manifest unknown$ docker pull registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.16.0Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.16.0 not found: manifest unknown: manifest unknown...$ docker pull registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1
v2.14.1: Pulling from acs/tiller
e7c96db7181b: Pull complete
def2a4ea1207: Pull complete
eba3e5d4aab0: Pull complete
94e8118fc9e9: Pull complete
Digest: sha256:f8002b91997fdc2c15a9c2aa994bea117b5b1683933f3144369862f0883c3c42
Status: Downloaded newer image for registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1
registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1

…so v2.14.1 was the latest version available (at the time of writing).

Helm offers the option to specify a custom source for the Tiller images upon initialization:

$ helm init --tiller-image registry-vpc.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1

Please note that when I pulled images on the local machine I used:

registry.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1

but I deployed Tiller on Kubernetes using:

registry-vpc.cn-hangzhou.aliyuncs.com/acs/tiller:v2.14.1

to take advantage of the internal network, as the “pull” action during deployment happens on the cluster.

You can (and should) also change the region name (cn-hangzhou) with the region where your cluster was deployed to.

Enjoy your working Tiller now! ;)

The solution shown above promotes the advantages of using managed resources and, of course, I strongly recommend that. That said, keeping a backup copy of third-party components, a practice commonly known as vendoring, can still save you from really awkward situations.
I always pull dependencies from official sources or switch to trusted mirrors when appropriate, such as in the scenario above, but I like to keep backup copies as well, just in case.

RockedScience

Tutorials, tips and fast news on Cloud, DevOps and Code

Edoardo Nosotti

Written by

Senior Cloud Solutions Architect and DevOps engineer, passionate about AI, conversational interfaces and IoT.

RockedScience

Tutorials, tips and fast news on Cloud, DevOps and Code

Edoardo Nosotti

Written by

Senior Cloud Solutions Architect and DevOps engineer, passionate about AI, conversational interfaces and IoT.

RockedScience

Tutorials, tips and fast news on Cloud, DevOps and Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store