Great postmortem. You guys made some very poor decisions regarding your infrastructure that reflect lack of experience. Of course, true experience is only gained from_exactly_ the sort of thing that happened. Bad admins try to CYA (“it was an AWS problem”), fix the problem quietly, and then later make the same mistakes or have their successors make the same mistakes. You deserve a lot of credit for the honest evaluation and for learning the appropriate lessons, and even more credit for sharing those lessons with the rest of the world.
Of course, having mistakes bite you is unfortunate (if not somewhat deserved). Having it happen at 3am is really unfortunate (especially if you’re not in the same time zone as a lot of your customers). Having a problem with your alerting system at 3am when things break is really, really unfortunate. And having a full AWS lockout at the time…well, there aren’t enough “reallys” for that one. You guys definitely deserves all the #opshugs, and probably a large number of beers afterward.