A while back I was working for a mid-size client who used AWS as their infrastructure provider. For a long time, there had been little governance over the AWS account, with development teams having full access to the account functions and often provisioning their applications manually. As the cumulative ecosystem of the organization grew to around 300 EC2 instances, manual provisioning was being gradually phased out across teams in favor of automation, thanks to services like CloudFormation, Elastic Beanstalk, and Ansible, however originally set privileges remained still in place.
On one occasion, a set of AWS access keys accidentally leaked to a public GitHub repository, which brought a devastating result, as a malicious bot finds them and wreaks havoc across the organization.
Below is the timeline of the incident, relative to the first event (timestamps in hh:mm format)
AWS Key ID and Secret ID are published to a developer’s personal public repository on GitHub
Less than a minute later, a malicious bot scanning GitHub repositories finds the published keys. Using the keys, it creates its own AWS user with high privileges within the AWS account, that it then uses for all subsequent actions. The first task is to remove all existing users and disable existing access keys.
At this point, the organization loses control over the account, however, is not yet aware of it.
AWS Security team notifies the company’s IT team over email of a potential account compromise. The team confirms it and with the help of AWS, recovers access to the account. There is much confusion about what has happened and what the impact is. There is no knowledge of the newly created malicious user and the actions it is invoking.
The bot starts deleting EC2 instances. To accomplish this task, it spins off a lambda, that performs this ‘clean-up’ of resources from the inside.
All EC2 instances have been removed. The bot now attempts to spin up new instances (likely for the purpose of mining bitcoins), however, is blocked by AWS’s malicious behavior recognition mechanism. A true war of the machines!
The IT team detects the malicious user and deletes it. From this point on, a long, difficult and highly manual recovery process takes place.
All production systems were impacted by downtime. Out of a total of 15 production applications, about a third was recovered on the same day, with the rest being unavailable for up to 3 days. The most problematic had been legacy services, in maintenance mode for more than a year, deployed manually. For some of them, the necessary skills were not present in the team anymore.
No sensitive data had been compromised during the incident. Given the breadth of the privileges captured by the malicious bot, it can be considered luck, rather than anything else. Likely, stealing data was not the attackers' focus.
The incident caused major disruption across all the teams, randomizing ongoing development and diverting resources towards analysis and recovery efforts. Additionally, for weeks to come, teams would be stumbling on some components, like test services still not recovered after the incident. This would lower productivity for many weeks.
It is tempting to put the blame on the individual who leaked the access keys to the wide internet, however, it is not productive. Mistakes are in our nature and rather than wasting energy on trying to eliminate them or find the scapegoat, use the time to establish solid guardrails. We, developers, are like ants — we look for the path of least resistance and follow it, so make it easier to do the right thing and incidents will be less likely.
Least privilege principle
I’m not a fan of overly focusing on preventive measures, as they tend to result in point solutions — disproportionately favoring the case at hand, which over time results in over complex systems. However, there are well-established security patterns, which should be followed. One of them is the ‘Principle of Least Privilege’, which requires that users and services can access only the resources they absolutely need.
Prevention, while a good driver for constant improvement, will never fully eliminate incidents. Assuming that something will eventually go wrong (and it will!), you give yourself the mental space to think about recovery as your first class operations tool, rather than a plan B. Prioritizing improvement of Mean Time to Repair (MTTR) over Mean Time Between Failures (MTBF) yields, therefore, better results.
For extra points, employ tools like Chaos Monkeys or hold periodic ‘war games’, where various attack scenarios can be exercised. This is also a good opportunity to evolve and document recovery processes and ensure proper communication channels within the organization are readily available. Be sure to include all the services, especially the legacy ones.
Adopt a more aggressive strategy for legacy services. Either spend the effort to phase them out completely or invest in automation and monitoring. Being in limbo with old services — low operational maturity, as well as lack of a solid plan for sun setting is common and puts the organizations at a very vulnerable position.
Ensure complete recovery
Naturally, production apps impacting real users take priority when performing a recovery. Reaching ‘green’ state for all the production services is however not sufficient. It’s important to recover all components of the ecosystem, that participate in the end to end development cycle. Think test services, CI agent pools, build monitors, etc. Without this work being prioritized, the team’s productivity will be taking unexpected hits over many weeks to come.