Avoid Blind Wait In DevOps Code

Occasionally DevOps code needs to check and wait status, before running further steps. For example, wait for service A to be up, then start service B; confirm TCP port is listening, then launch requests; etc.
For simplicity or time pressure, people usually use a blind wait like “sleep 10” to fix this. This is certainly not good enough. How we can improve this with affordable cost?
Original Article: http://dennyzhang.com/blind_wait
Let’s examine below automation requirement, which is quite common in daily life of DevOps. You’re asked to start service1, then service2. However you can only start service2, after service1 is up and running well. If not, service2 may fail to start or run into unexpected behaviors.
Solution v1.0: blind wait
service service1 start
sleep 10
service service2 start
Here we wait for a while (10 seconds) in between. The good news is it may work in most cases. However this implementation has two drawbacks:
- No Guarantee Of Assumption. Even after waiting for 10 seconds, we can’t be sure service1 is up. Service2 start may still fail. Furthermore running following steps with this false assumption may result in unexpected situation.
- Waste Of Time. Let’s say service1 start usually takes less than 4 seconds. This means we always waste over 6 seconds doing the blind wait.
To improve this, we can keep polling the status of service1. Though it’s usually not easy to claim whether service is 100% healthy, we can make a safe trade-off. If “service XXX status” reports running or the TCP port is listening, we can say the service is probably OK.
Solution v2.0: wait with bash loop
service service1 start
# Wait and poll status with timeout mechanism
tcp_port=8080
timeout_seconds=10
check_pass=false
for((i=0; i<timeout_seconds; i++)); do
if lsof -i tcp:$tcp_port | grep -i listen; then
echo "$tcp_port is listening"
check_pass=true
break
fi
sleep 1
done
if $check_pass; then
echo "check pass"
else
echo "check fail"
fi
service service2 start
With around 20 extra code lines, we solve the problem beautifully. So is it good enough now? Not really! We can see this requirement is very common, which indicates a lot of code duplication.
What we need is a common wait mechanism. If the condition meets, it reports OK. If it fails or timeout, it reports ERROR. Here comes a general tool: wait_for.sh

Solution v3.0: wait in a simple and clever way
service service1 start
wait_for.sh "nc -z -v -w 5 127.0.0.1 8080" 10
service service2 start
Simply and Easy. Isn’t it?
More Reading: Parallel Run Commands On Multiple Servers
Like our blog posts? Discuss with us on LinkedIn, Twitter Or NewsLetter.