The Time Our Security Engineer Made Stan Too Secure

Stangineering Team
Engineering @ Stan
Published in
3 min readMay 24, 2024

Here at Stan, we always aim for stan-out-of-ten execution. In this case, we should have settled for 9. Here is the story of how our security architect, Kc, caused an outage by making Stan a little too secure.

How We Found Out

It was shortly after team lunch during “Securithon”. This is a two-day session once a month focused on improving security posture and site reliability. Ironically, we were about to become as unreliable as we could possibly be.

Ricky, one of our engineers, soon observed the staging environment returning unusual 403 responses while trying to QA a feature. It’s only staging environment so he lightly notifies the team. Shiva, our head of engineering, in turn informed Kc of the situation as they had previously discussed implementing AWS WAF rules. Very shortly after, two other engineers confirmed that the issue existed in Production as well.. uh ohhh we have an outage.

Diagnosis

It happened that about an hour before lunch (and Ricky’s discovery), Kc was investigating ways to defend against malicious inputs while the engineering team came around to fixing some platform vulnerabilities. AWS WAF had some managed rules that seemed applicable and effective at preventing the exploits. After some API testing, Kc turned on the rules in staging environment and eventually (after about 40 minutes), implemented the same rules on production (you see where this is going).

It appears these rules were too aggressive and caused false-positive legitimate requests to be blocked at the firewall level. Our application had become so secure that nobody could use it for a brief second!

Immediate Actions

Kc deleted the rules from the WAF and investigated how we can still get proactive web application security without sacrificing uptime.

What did we learn?

  • Even security changes at the infrastructure or backend level should trigger QA testing.
  • Since the testing of the AWS WAF rules was limited to the problem they were intended to solve, Kc missed the side effects from the UI perspective and, thus, other functionalities that could be impacted.
  • When deploying AWS WAF rules as an interim solution for existing vulnerabilities, ensure they are tightly scoped.
  • The scope of the rules was too wide and impacted endpoints that should not have been impacted. Specifically, since the vulnerable endpoints were known, the rule should have been restricted to the endpoint. Since this was intended to be an interim fix it didn’t need to protect “future” or “variant” attacks.

Just like Ikarus, we flew too close to the sun and learned that day that too much excitement can lead to unexpected problems. What a bummer.

🤔 Do you have a similar story? Share it with us! Pinky promises we’ll not laugh ( too hard ).

Until next time,

The Stangineering Team

--

--