Health Checks in Rancher using TCP
There are currently two ways to run health checks in Rancher: HTTP and TCP (http://docs.rancher.com/rancher/latest/en/cattle/health-checks/). The former is relatively straightforward, however it requires that you have a web server with an available route able to respond with a 2xx/3xx. TCP checks are a nice option for services without a web server, as you can do it in a lightweight fashion.
We built a very simple TCP server that we use in a lot of our services. This has one task: allowing an agent to open a connection to a specified port. It promptly shuts the connection after it’s opened — that’s all the health check needs to see.
Below is what our script looks like:
Below is an example script that can be defined as a container’s entrypoint. It executes `health_check.py`as a background process and then continues with whatever else your container’s entrypoint should do (in this case, executing `my_awesome_script.py`):
This container is now TCP health-check-compatible! Your rancher-compose file would look something like:
my-worker-service:
scale: 2
health_check:
port: 12345
interval: 4000
initializing_timeout: 30000
unhealthy_threshold: 3
strategy: recreate
healthy_threshold: 2
response_timeout: 2000
Health Checks in the Wild
This method has been particularly useful to us in the case of spot instances on AWS (See: https://medium.com/@pitrho/reduce-server-costs-with-spot-instances-e6abd8da1bff for more info about Spot Instances). Currently, when a host disappears, it stays in a ‘reconnecting’ state in Rancher and services that exist on that host do not automatically move elsewhere.
Imagine the following scenario:
You have a service with a scale of 2 that schedules onto spot instances. Two new spot instances (Hosts A and B) connect to Rancher, your services schedule correctly. One of your instances (Host B) gets shut down and another (Host C) spins up in it’s place. Rancher now has 3 hosts: Host A (active), Host B (reconnecting), Host C (active). Your service now has only 1 container running on Host A.
If you deploy your services with health checks and the recreate strategy, then after your unhealthy_threshold has been met, Rancher will try to recreate the second container, find that Host C is available, and schedule the container accordingly.
by Gilman Callsen • @gcallsen • Founder & CTO of Rho AI