Deploying Anthos Cluster on BareMetal -Beware Proxy Servers Involved -Part 1
Anthos clusters on bare metal is software to create, manage, and upgrade Kubernetes clusters on your own hardware in your own data center.
This article is Part 1 of a 3 Part series for deploying Anthos Cluster on BareMetal.
Part 2
Part 3
Anthos can run on your existing virtualized infrastructure and bare metal servers without a hypervisor layer. Anthos simplifies your application stack and reduces the costs associated with licensing a hypervisor. With Anthos you get a consistent managed Kubernetes experience with upgrades validated by Google.
Some terminologies:
1. Admin Workstation: A VM from where we manage our Kubernetes cluster and applications.
2. Bootstrap Cluster: A temporary cluster to host the Kubernetes controllers needed to create or upgrade the actual Anthos cluster.
3. Control Nodes: VMs that forms the control plane of the cluster.
4. Worker Nodes: VMs that is responsible for running your applications.
5. Standalone Cluster: A type of Anthos Cluster that can administer itself, and that can also run workloads, but can’t create or manage other user clusters.
However when it comes to installation, as simple as the documentation seems it isn’t a cakewalk when it comes to an infrastructure that runs behind a proxy server, which is almost a standard for the majority of the companies for several reasons.
And suddenly it wasn’t a walk in the park as the installation was failing at every single step. And having to pull in the Anthos product support team to finally conclude after many debug sessions that the product doesn’t support Man in the Middle (MITM) proxy and that it is a feature request was a nightmare.
So this blog is about the Standalone Cluster using Anthos Version v1.13 on RHEL 8.6 VMs Hosted on vSphere, where the connectivity to the internet had to go through MITM Proxy Servers and that’s where the cluster creation was failing due to Certificate validation issues.
Here is the list of the major issues we faced. If you find one similar to yours keep reading ;)
# RHEL supports Podman but Preflight checks for Docker
[2022-10-28 11:20:49+0000] Error creating cluster: error to parse the target cluster: error parsing cluster config: 1 error occurred:
* Docker version too old: got 4.1.1, want at least 19.03.0
# GCR images failing to download
error creating bootstrap cluster: failed to pull image "gcr.io/anthos-baremetal-release/kindest/node:v0.14.0-gke.11-v1.24.2-gke.1900"
# Unsupported package managers being looked out for in RHEL
I1130 14:49:50.891124 2241567 console.go:73] - Could not detect a supported package manager from the following list: ['portage', 'rpm', 'pkg', 'apt']
# Logs that doesn't tell much about the actual problem
I1206 10:50:06.880910 3783103 logs.go:82] "msg"="Cluster reconciling:" "message"="Get \"https://10.y.y.y:443/api?timeout=32s\": dial tcp 10.y.y.y:443: connect: connection reset by peer" "name"="caxe-cluster" "reason"="ReconciliationError"
Now the solution to any https request through MITM is quite simple. For instance:
1. adding "insecure-registries" : ["some-public-registry"]
to your daemon.json
and the proxy environment variables in /etc/systemd/system/docker.service.d/http-proxy.conf
to configure docker
2. adding proxy
variables to /etc/yum.conf
to configure your package manager
3. adding http_proxy
,https_proxy
,no_proxy
in the environment variables along with the --insecure
flag for curl
commands
At least that’s what we did to get started of course. Which would ideally help us spin up the bootstrap cluster.
However, things will soon start to become scary as soon you will find out that the bootstrap cluster is a KIND cluster that runs on its own containerd
runtime and there is no way for you to configure proxy environment variables for it in the bootstrap cluster, and hence our KIND cluster is unable to download container images from google’s public repository.
Sorry to burst your bubble but the containerd
configuration of your admin workstation has no impact on the bootstrap cluster’s environment.
Not only that, but the bootstrap cluster runs various ansible
jobs that are run on your control nodes called pre-flight checks which if fails your standalone cluster wouldn’t start.
I know there is a flag --force
to ignore preflight check failures. However, some ansible
jobs within the preflight apparently must pass in order to create the standalone cluster.
Well, there are also certain packages that these ansible
job tries to install on the control nodes and yet again it fails due to its inability to handle MITM proxy issues.
This has been the most interesting debugging of my career until now. And if you wanna know more about how we solved it. I’ll see you in Part 2 of this series.