Instapaper Outage Cause & Recovery
Brian Donohue
59170

Hi Brian, thanks for the detailed and thoughtful write-up. This sounds like a difficult situation to troubleshoot and resolve, and was no doubt a very stressful situation for all involved. Transparency about process and mistakes is always appreciated, it helps us all to get better and avoid the same errors. And this situation certainly reveals some of the potential downsides of using a hosted database solution wherein you do not have access to the underlying OS and filesystem.

The decision to attempt to correct the error and restore service using the existing technology stack was the right one, it is never a good idea to make dramatic changes while dealing with an outage. Now that the issue is resolved, it would be appropriate to re-evaluate the architecture and prioritize additional operational work to improve the platform. You have already identified some important technical debt: the service is running in EC2 classic and thus does not have access to the most up-to-date, stable, and supported Amazon services. Moving to VPC is a prudent move, and also provides a foundation for migrating to Aurora, if that is determined to be a good decision.

Furthermore, your write-up seems to indicate that there is no full-time operational staff (i.e. Pinterest SREs or others) with any direct responsibility for the Instapaper service. The lack of oversight from the operations side is an easy way to end up with benign or accidental neglect of critical service health, such as monitoring, patching, backup/recovery, etc. Clearly you have all of these things to some extent today, and I don’t mean to criticize your level of engagement, simply to suggest that dedicated operations engineers could help take over some of that burden and increase site reliability.

Thanks for continuing to provide such a valuable service, and best of luck!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.