Health Checks in Rancher using TCP

There are currently two ways to run health checks in Rancher: HTTP and TCP (http://docs.rancher.com/rancher/latest/en/cattle/health-checks/). The former is relatively straightforward, however it requires that you have a web server with an available route able to respond with a 2xx/3xx. TCP checks are a nice option for services without a web server, as you can do it in a lightweight fashion.

We built a very simple TCP server that we use in a lot of our services. This has one task: allowing an agent to open a connection to a specified port. It promptly shuts the connection after it’s opened — that’s all the health check needs to see.

Below is what our script looks like:

health_check.py

Below is an example script that can be defined as a container’s entrypoint. It executes `health_check.py`as a background process and then continues with whatever else your container’s entrypoint should do (in this case, executing `my_awesome_script.py`):

worker-up.sh

This container is now TCP health-check-compatible! Your rancher-compose file would look something like:

my-worker-service:
scale: 2
health_check:
port: 12345
interval: 4000
initializing_timeout: 30000
unhealthy_threshold: 3
strategy: recreate
healthy_threshold: 2
response_timeout: 2000

This method has been particularly useful to us in the case of spot instances on AWS (See: https://medium.com/@pitrho/reduce-server-costs-with-spot-instances-e6abd8da1bff for more info about Spot Instances). Currently, when a host disappears, it stays in a ‘reconnecting’ state in Rancher and services that exist on that host do not automatically move elsewhere.

Imagine the following scenario:

You have a service with a scale of 2 that schedules onto spot instances. Two new spot instances (Hosts A and B) connect to Rancher, your services schedule correctly. One of your instances (Host B) gets shut down and another (Host C) spins up in it’s place. Rancher now has 3 hosts: Host A (active), Host B (reconnecting), Host C (active). Your service now has only 1 container running on Host A.

If you deploy your services with health checks and the recreate strategy, then after your unhealthy_threshold has been met, Rancher will try to recreate the second container, find that Host C is available, and schedule the container accordingly.

by Gilman Callsen • @gcallsen • Founder & CTO of Rho AI

--

--

Rho AI builds customized data science products and services to solve real-world problems.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rho AI

Rho AI builds customized data science products and services to solve real-world problems.