My automated RKE update pipeline broke with version 0.2.x — my fault
I’m using an automated build pipeline to install, update and destroy my Kubernetes test-environments based on Rancher Kubernetes Engine (RKE). This worked perfectly until this week.
Let me shortly explain the important parts of my pipeline before I talk about the details:
- checkout the cluster.yml from a git repository
- extract the kube_config_cluster.yml from the secured repository cache
- download the latest stable RKE binary from GitHub (I use this in my test environment where I would like to stay updated all the time)
- run “rke up/remove”
As I said, this worked perfectly for some time now. Earlier this week I pushed a new version of my cluster.yml to update my Kubernetes Cluster to a newer version. The pipeline started and failed some minutes later with the following error:
Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy
I started debugging the issue but I couldn’t find any issues. The whole Kubernetes Cluster including the etcd looked good. After some time I realized that Rancher released a new RKE version 0.2.x. (until now I used 0.1.x but because of step 3 of my pipeline the build run used the latest available stable version). So why does this even matter? With the new version of RKE Rancher introduced a new way to store the cluster state. They moved it from a configmap entry (0.1.x) to a file called cluster.rkestate (0.2.0) which is stored next to the cluster.yml. Because I wasn’t aware of this file my pipeline didn’t store it somewhere and therefore a “rke up” always created a new cluster.rkestate file. which then leads to the issue described above. After changing my pipeline configuration to also cache the state file the updated finished successfully without any issues.
What have we learned from this? Always read the release notes. 😏