The Time Our Security Engineer Made Stan Too Secure
Here at Stan, we always aim for stan-out-of-ten execution. In this case, we should have settled for 9. Here is the story of how our security architect, Kc, caused an outage by making Stan a little too secure.
How We Found Out
It was shortly after team lunch during “Securithon”. This is a two-day session once a month focused on improving security posture and site reliability. Ironically, we were about to become as unreliable as we could possibly be.
Ricky, one of our engineers, soon observed the staging environment returning unusual 403 responses while trying to QA a feature. It’s only staging environment so he lightly notifies the team. Shiva, our head of engineering, in turn informed Kc of the situation as they had previously discussed implementing AWS WAF rules. Very shortly after, two other engineers confirmed that the issue existed in Production as well.. uh ohhh we have an outage.
Diagnosis
It happened that about an hour before lunch (and Ricky’s discovery), Kc was investigating ways to defend against malicious inputs while the engineering team came around to fixing some platform vulnerabilities. AWS WAF had some managed rules that seemed applicable and effective at preventing the exploits. After some API testing, Kc turned on the rules in staging environment and eventually (after about 40 minutes), implemented the same rules on production (you see where this is going).
It appears these rules were too aggressive and caused false-positive legitimate requests to be blocked at the firewall level. Our application had become so secure that nobody could use it for a brief second!
Immediate Actions
Kc deleted the rules from the WAF and investigated how we can still get proactive web application security without sacrificing uptime.
What did we learn?
- Even security changes at the infrastructure or backend level should trigger QA testing.
- Since the testing of the AWS WAF rules was limited to the problem they were intended to solve, Kc missed the side effects from the UI perspective and, thus, other functionalities that could be impacted.
- When deploying AWS WAF rules as an interim solution for existing vulnerabilities, ensure they are tightly scoped.
- The scope of the rules was too wide and impacted endpoints that should not have been impacted. Specifically, since the vulnerable endpoints were known, the rule should have been restricted to the endpoint. Since this was intended to be an interim fix it didn’t need to protect “future” or “variant” attacks.
Just like Ikarus, we flew too close to the sun and learned that day that too much excitement can lead to unexpected problems. What a bummer.
🤔 Do you have a similar story? Share it with us! Pinky promises we’ll not laugh ( too hard ).
Until next time,
The Stangineering Team