First fire drill and post-mortem

Published in

Find Work

3 min readFeb 13, 2017

We just had our first fire drill and we haven’t even launched yet. Thankfully it was relatively easy to patch, it was great practice, and it highlighted some areas for improvement.

What happened

Our metrics monitoring service, Librato, had been sending us email alerts since a deployment yesterday (Friday Feb 10 at 6:00PM).

We typically wouldn’t deploy this late (good rule of thumb is cut off 2 hours before you plan to leave) but we aren’t live yet.

We finally saw them on Saturday Feb 11 at 4:15PM, logged in, shut off the alert, and started poking around.

Librato was right — memory usage was high but we weren’t sure whether this was natural application growth or something recent.

One point of frustration was no deploy times on this graph. It would have easily helped with our initial triage.

The first thing to rule out is rogue processes on the server. I ssh’d in, ran htop and everything looks good. Unfortunately, I don’t have a screenshot of this. We had high “RES” but only for expected processes (e.g. Node.js).

Next is to find out which process is the culprit. Since only Node.js was high, we figured this must be the culprit. The patch came in 2 parts:

We use cluster to run multiple worker processes. The master process was loading our application content (only workers should do this) so it was being bloated from 20MB to 120MB.
In a deployment yesterday, we added Maxmind for GeoIP lookup so we can assume user timezones. This bloated all of our processes (1 server master, 2 server workers, and 1 queue worker) from 70MB to 120MB. We removed loading Maxmind in our queue worker as we don’t need it.

After some Vagrant reproduction and branch-based deployments, we verified that the error had been resolved and our memory usage was back to normal:

Time of resolution: Saturday Feb 11 at 5:30PM

Impact

Typically postmortems include how many users were effected, queue delays, or something similar. Since we haven’t launched yet, this section isn’t necessary but we’ll do it nonetheless:

Users effected: 0
Additional impact: None

Improvements

As with any post-mortem, it’s useless if we don’t learn from our mistakes. Here’s our plan to prevent future issues as well as resolve them faster:

Be more diligent memory-wise when programming
Enable more interrupt-based alerts (e.g. SMS, push notifications)
Add deployment markers to Librato for easier triage

First fire drill and post-mortem

What happened

Impact

Improvements

Written by Todd Wolfson