Luckily, I’ve faced a situation where a cluster crashed, losing two master nodes due to hardware failure. What made matters worse was the absence of a backup for the etcd data. Initially, we believed that the only solution was to rebuild everything from YAML files. However, we discovered a method to salvage the situation through some manual tweaks. I have marked down the steps in a similar environment for all the other lucky guys.
At first, we have a normal cluster running.
Now, we stop 2 masters in the back-end virtual machine panel.
After a while, we lost the connection to the API server due to the unhealthy etcd.
Now we move on to the only left master node to see if there is any luck.
After the stop of 2 masters. We would see the error return by etcdctl exactly like the screenshot above.
So now we start to rescue. However before the steps worked we had spent few nights o that. Hope you are not the lucky next.
cp -r /var/lib/etcd/member /home/core/member.bak
cp /var/lib/etcd/member/snap/db /home/core/snapshot-recovery.db