Deploying Anthos Cluster on BareMetal -Beware Proxy Servers Involved -Part 3
Anthos clusters on bare metal is software to create, manage, and upgrade Kubernetes clusters on your own hardware in your own data center.
This article is Part 3 of a 3 Part series for deploying Anthos Cluster on BareMetal.
Part 1
Part 2
Once you have your Google Project Created and have all the prerequisites as defined in the Part 2 of this series, we shall log in to our admin workstation and follow the QuickStart guide as defined here.
Once we have generated a config file that holds all the configuration for our cluster make sure your fill in the appropriate fields such as the projectID
, proxy.url
, loadbalancer.vips
, and so on.
And once we start creating the cluster, we can find the bmctl
, preflight checks, and node installation logs usually in a similar path: ~/anthos-project/bmctl-workspace/caxe-cluster/log
.
The
bmctl
preflight checks the proposed cluster installation for the following major conditions:The Linux distribution and version are supported.
Google Container Registry is reachable.
The VIPs are available.
The cluster machines have connectivity to each other.
Load balancer machines are on the same Layer 2 subnet.
And here we faced our first problem i.e.
[2022-11-02 11:18:41+0000] Creating bootstrap cluster... ⠸ create kind cluster failed: error creating bootstrap cluster: failed to pull image "gcr.io/anthos-baremetal-release/kindest/node:v0.14.0-gke.11-v1.24.2-gke.1900": command "docker pull gcr.io/anthos-baremetal-release/kindest/node:v0.14.0-gke.11-v1.24.2-gke.1900" failed with error: exit status 1
This error was actually due to the MITM proxy server as verified by the following command
$ docker pull gcr.io/anthos-baremetal-release/kindest/node:v0.14.0-gke.11-v1.24.2-gke.1900
Error response from daemon: Get "https://gcr.io/v2/": x509: certificate signed by unknown authority
Now the solution here would have been simple by setting the value "insecure-registries" : ["gcr.io"]
in /etc/docker/daemon.json
. However, it wasn’t working at all. While the product team asked us to have gcr.io
through a proxy pass, after a lot of troubleshooting we found that the KIND cluster runs on containerd
, and has its own config file that does not reflect the containerd
configuration of our admin workstation.
It took a bit of digging through github issues and stackoverflow queries to find answers to our weird config.toml
file that the cluster uses, and that’s when adding the insecure flag in this weird syntax which looks totally different from the usual daemon.json
helped us get rid of the error.
[plugins."io.containerd.grpc.v1.cri".registry.configs."gcr.io".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."gcr.io".auth]
Now the question is where to make the changes. Well, it really wasn’t simple to make it happen as we have to copy or custom file into the kind cluster -> start a bash session into it, -> stop containerd
service -> unmount the default config.toml
file -> replace it with our custom file -> restart containerd
. Bingo!!! That was a success but we still had a lot of issues coming ahead of us.
FYI: We found a better solution to this. We shall discuss that later.
Now that we had the images being pulled, the next series of errors were coming from the ansible jobs that were running as part of pre flight checks. One of the errors was
I1130 14:49:50.891124 2241567 console.go:73] - Could not detect a supported package manager from the following list: ['portage', 'rpm', 'pkg', 'apt'], or the required Python library is not installed. Check warnings for details.
Python DNF package error:
fatal: [10.x.x.x]: FAILED! => {"changed": false, "cmd": "dnf install -y python2-dnf", "msg": "Could not import the dnf python module using /usr/bin/python (2.7.18 (default, Feb 10 2022, 14:26:12) [GCC 8.5.0 20210514 (Red Hat 8.5.0-10)]). Please install `python2-dnf` package or ensure you have specified the correct ansible_python_interpreter.", "rc": 1, "results": [], "stderr": "Error: Unable to find a match: python2-dnf\n", "stderr_lines": ["Error: Unable to find a match: python2-dnf"], "stdout": "Last metadata expiration check: 1:40:28 ago on Thu 01 Dec 2022 01:19:29 PM CST.\nNo match for argument: python2-dnf\n", "stdout_lines": ["Last metadata expiration check: 1:40:28 ago on Thu 01 Dec 2022 01:19:29 PM CST.", "No match for argument: python2-dnf"]}
Since we were on RHEL and the ansible jobs were looking for package managers that it doesn’t have, finding the solution to it made no sense. So sharing the logs with the product team we got the suggestion of updating the python version to 3.9
as the ansible version used within these preflight checks is 2.9.16
which was dependent on version python 3.9.2
. However, upgrading it didn’t really help. So the next solution they provided was to set the environment variable ansible_python_interpreter=OUR_PYTHON_PATH
.
And this time it worked. But the long chain of errors didn’t stop appearing, and this time the failure was due to an ansible job that was failing on the control node which meant the bootstrap cluster was stable but the standalone cluster was facing issues.
I1206 10:50:05.146777 3783103 logs.go:82] "msg"="Waiting for pod to finish" "Name"="bm-system-10.x.x.x-machine-init-630d3985432c54ff9edb6f46dkv5" "Phase"="Running"
I1206 10:50:06.880910 3783103 logs.go:82] "msg"="Cluster reconciling:" "message"="Get \"https://10.y.y.y:443/api?timeout=32s\": dial tcp 10.y.y.y:443: connect: connection reset by peer" "name"="caxe-cluster" "reason"="ReconciliationError"
Well, this was when we were once again told that since MITM proxies are not supported, it is best to have a pass proxy for google URLs
as this would be a feature request for Anthos since for each step in the ansible job we were receiving a failure caused by MITM proxy.
Well, this is when we dug into the ansible pods to find out more about them. That is where we saw it has one flag that is passed to it i.e. -use-registry-mirror
, which gave us the idea of maybe using a private registry might help us.
So we created a private registry to host the gcr
images and it actually helped us to get rid of all that mumbo jumbo around config.toml
. Well, that was a good news.
However, we still needed to figure out the errors coming due to certain packages not being downloaded due to TLS
issues as seen in the logs of the preflight jobs.
Therefore we deployed the image used in ansible to start a shell session into it and dissected everything inside of it. So this is where we found all the packages the jobs needed and installed them manually on the control node as in the logs of the ansible jobs we saw it was failing due to TLS issues while trying to reach fedora.org
.
And finally, we had a beautiful cluster running, and that marked the end of all the never-ending errors.
Well, that was everything you’d need for creating a standalone cluster using Anthos 1.13 on BareMetal which felt more like an angry Bear roaring heavy Metal.
That’s it, Folks!!!! See you in the next one.
For any additional queries please visit Blue Altair or send an email here.