Outage Postmortem — 20 August 2016
On August 20, the healthchecks.io service experienced a 24 hour outage. During this time, it was unable to process any incoming pings. This caused a large number of checks to go into the “down” state. This, in turn, caused a large number of “Your check has gone down” alerts to be sent out.
The direct cause was an unattended upgrade which restarted the PostgreSQL database process. The database restart alone should have only caused a brief downtime. Unfortunately, one of the database clients was not prepared to deal with unexpectedly closed database connections. It continuously tried to use the closed connection–and failed every time.
Meanwhile, the single maintainer of healthchecks.io website (which is me, Pēteris) was out and about, unreachable by email, and unaware of any of the incoming monitoring alerts or twitter messages.
The healthchecks.io service consists of the following components, each running on a separate virtual machine hosted at DigitalOcean, running Ubuntu 16.04:
- a server running the PostgreSQL database
- a server running the ping listener service. This is a NGINX web server reverse-proxying a small, custom Go program. Let’s call the program “hchk.go”. The hchk.io address points to this server.
- a server running the Django website. This is a caddy web server, reverse-proxying an uwsgi process. The healthchecks.io address points to this server.
- a server running the Django management command for sending out alerts.
Out of the box, an Ubuntu 16.04 server comes with unattended security updates enabled. Late August 19, the database server installed a security update for PostgreSQL. It then proceeded to restart the database process. During the restart, any open database connections are of course closed.
After the database restart, the website and the alerting service continued to work normally. They use Django framework and psycopg2 database driver, which together take care of opening new connections as is necessary.
The hchk.go program, however, is written in a different programming language (Go), and uses a different database driver (pgx). The program was not tested in the scenario where the database closes a connection. Its error handling for failed SQL operations was effectively a “write the error to a log file and march on”. So, when the database restarted, hchk.go was stuck with a dead connection, and returned HTTP 400 “Bad Request” messages to each and every incoming ping. A simple restart of the hchk.go process fixed the immediate problem.
To make it clear, there is nothing wrong with the Go programming language or the database driver used. They were just used incorrectly by the author of the hchk.go program (me). The reason for writing a custom Go program in the first place is performance. Before implementing the hchk.go program, server’s CPU usage was starting to get dominated by the incoming pings. CPU was mostly spent running Python code. After the switch, CPU usage dropped significantly, and currently CPU is mostly spent on TLS handshakes.
The healthchecks.io website has a few provisions for monitoring itself:
- The virtual machines are monitored by NewRelic
- The healthchecks.io website is monitored by UptimeRobot
- The alert sending service is monitored by both Cronitor and Dead Man’s Snitch
The monitoring worked as expected: within 5 minutes after the database restart there were alerts sent out to me. Unfortunately no-one was around for the next 24 hours to see the alerts. Had I checked my email right after the issue started, it would still have taken about 300 kilometers and 4+ hours to get to a PC with the necessary SSH keys.
- Disable unattended upgrades. Done on the database server. Will do on the other servers during next upgrade cycle.
- Fix the hchk.go application to handle closed or misbehaving database connections. Done.
- Keep mobile roaming data enabled next time I go on a trip abroad, and read emails.
- Set up a small laptop with development and deployment tools, and take it on the longer trips.
In closing: I apologize to all healthchecks.io users for any inconvenience caused. For a monitoring service, any downtime is unacceptable.
Being run on a shoestring budget, healthchecks.io can only offer a best effort availability. I, however, welcome the challenge and will aim to make the best of the resources available!