Writing Goavail: A Cloud Monitoring and Fast DNS Failover Agent

Background

At Nitro we operate our cloud software offerings in AWS. However, since we strive to be vendor neutral we run our ingress traffic through Haproxy instances with Keepalived instead of the AWS ELBs. You can read more about Keepalived elsewhere, but the short of it is Keepalived uses the VRRP protocol to detect when HAProxy is down and fails over an AWS EIP with minimal downtime. We additionally leverage Cloudflare’s CDN-like proxying service. In this way we can DNS round robin between multiple Haproxy instances and be assured the EIP will be associated to a live process.

Haproxy with Keepalived

This is all great; we’ve achieved similar uptime as the ELBs, but with more control and flexibility over our load balancing and ingress/egress traffic. Unfortunately, as it always does, these benefits come with some tradeoff; namely that Keepalived is still susceptible to network partitions or the unlikely occurrence of an AWS zone outage.

DNS Failover would work well for this use case, but unfortunately its not offered through Cloudflare. We looked at other options but they either brought too much additional overhead or felt over-engineered. With a Nitro Hack week coming up, this seemed like the perfect opportunity to write a home-grown IP monitoring and failover agent.

Considerations

The idea is to deploy multiple agents in disparate regions (i.e. different vantage points) to all monitor the same critical endpoints of our Production environment. These agents should be lightweight and cheap as possible, but also not prone to false positives. They should be peer-aware and able to communicate and agree with one another before taking action.

The EIPs should continuously be monitored, regardless of their state. Should an EIP become unavailable, the agent should interface with Cloudflare to remove the associated records. Once the EIP is back online the agent should again interface with Cloudflare to re-add the appropriate A records.

These DNS updates should be fast. Since we proxy most of our endpoints through Cloudflare, updates to DNS records are transparent and much faster than pure DNS updates where the EIP itself might be cached at the client.

How We Use It

We have Goavail agents deployed in three droplets across separate regions. They’re all aware of each other and send periodic heartbeats (the tool uses Hashicorp’s memberlist library). Since the WAN is assumed to be unreliable the agents are configured to only trigger an event after five unsuccessful pings. At that point the agent will notify its peers of the EIP potentially being down. Each agent that also observes the EIP being offline will notify its peers as well. If an agent observes the EIP is down and receives agreement from both of its peers (this setting is configurable), it will update the appropriate Cloudflare records and take the EIP offline. Since we’re proxying in Cloudflare, this will almost immediately route all traffic through the remaining EIPs.

Deployed Goavail Agents

Our settings are deliberate in that we don’t expect very many of these events and we don’t want to trigger anything if the EIP isn’t really offline. But the settings are configurable for other scenarios too. Check out the github repo for more technical details.

Conclusion

This has been a brief summary of why Goavail was built and how we’re using it today. What started as a personal hack week project has now grown into a site reliability tool we use to monitor our Production environment.

Like what you read? Give Ben Parli a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.