First fire drill and post-mortem

Todd Wolfson
Find Work
Published in
3 min readFeb 13, 2017

We just had our first fire drill and we haven’t even launched yet. Thankfully it was relatively easy to patch, it was great practice, and it highlighted some areas for improvement.

What happened

Our metrics monitoring service, Librato, had been sending us email alerts since a deployment yesterday (Friday Feb 10 at 6:00PM).

We typically wouldn’t deploy this late (good rule of thumb is cut off 2 hours before you plan to leave) but we aren’t live yet.

We finally saw them on Saturday Feb 11 at 4:15PM, logged in, shut off the alert, and started poking around.

Librato was right — memory usage was high but we weren’t sure whether this was natural application growth or something recent.

Memory growth over past 3 days

One point of frustration was no deploy times on this graph. It would have easily helped with our initial triage.

The first thing to rule out is rogue processes on the server. I ssh’d in, ran htop and everything looks good. Unfortunately, I don’t have a screenshot of this. We had high “RES” but only for expected processes (e.g. Node.js).

Next is to find out which process is the culprit. Since only Node.js was high, we figured this must be the culprit. The patch came in 2 parts:

  • We use cluster to run multiple worker processes. The master process was loading our application content (only workers should do this) so it was being bloated from 20MB to 120MB.
  • In a deployment yesterday, we added Maxmind for GeoIP lookup so we can assume user timezones. This bloated all of our processes (1 server master, 2 server workers, and 1 queue worker) from 70MB to 120MB. We removed loading Maxmind in our queue worker as we don’t need it.

After some Vagrant reproduction and branch-based deployments, we verified that the error had been resolved and our memory usage was back to normal:

Post-resolution `htop`
Post-resolution Librato

Time of resolution: Saturday Feb 11 at 5:30PM

Impact

Typically postmortems include how many users were effected, queue delays, or something similar. Since we haven’t launched yet, this section isn’t necessary but we’ll do it nonetheless:

  • Users effected: 0
  • Additional impact: None

Improvements

As with any post-mortem, it’s useless if we don’t learn from our mistakes. Here’s our plan to prevent future issues as well as resolve them faster:

  • Be more diligent memory-wise when programming
  • Enable more interrupt-based alerts (e.g. SMS, push notifications)
  • Add deployment markers to Librato for easier triage

--

--

Todd Wolfson
Find Work

Mathematician turned software engineer. Hacking on JS and Ruby at Standard Cyborg. Founder at Find Work. Formerly at Underdog.io, Uber, Ensighten, and Behance