First fire drill and post-mortem
We just had our first fire drill and we haven’t even launched yet. Thankfully it was relatively easy to patch, it was great practice, and it highlighted some areas for improvement.
What happened
Our metrics monitoring service, Librato, had been sending us email alerts since a deployment yesterday (Friday Feb 10 at 6:00PM).
We typically wouldn’t deploy this late (good rule of thumb is cut off 2 hours before you plan to leave) but we aren’t live yet.
We finally saw them on Saturday Feb 11 at 4:15PM, logged in, shut off the alert, and started poking around.
Librato was right — memory usage was high but we weren’t sure whether this was natural application growth or something recent.
One point of frustration was no deploy times on this graph. It would have easily helped with our initial triage.
The first thing to rule out is rogue processes on the server. I ssh
’d in, ran htop
and everything looks good. Unfortunately, I don’t have a screenshot of this. We had high “RES” but only for expected processes (e.g. Node.js).
Next is to find out which process is the culprit. Since only Node.js was high, we figured this must be the culprit. The patch came in 2 parts:
- We use
cluster
to run multiple worker processes. Themaster
process was loading our application content (onlyworkers
should do this) so it was being bloated from 20MB to 120MB. - In a deployment yesterday, we added Maxmind for GeoIP lookup so we can assume user timezones. This bloated all of our processes (1
server
master, 2server
workers, and 1queue
worker) from 70MB to 120MB. We removed loading Maxmind in ourqueue
worker as we don’t need it.
After some Vagrant reproduction and branch-based deployments, we verified that the error had been resolved and our memory usage was back to normal:
Time of resolution: Saturday Feb 11 at 5:30PM
Impact
Typically postmortems include how many users were effected, queue delays, or something similar. Since we haven’t launched yet, this section isn’t necessary but we’ll do it nonetheless:
- Users effected: 0
- Additional impact: None
Improvements
As with any post-mortem, it’s useless if we don’t learn from our mistakes. Here’s our plan to prevent future issues as well as resolve them faster:
- Be more diligent memory-wise when programming
- Enable more interrupt-based alerts (e.g. SMS, push notifications)
- Add deployment markers to Librato for easier triage