Many engineers have found themselves in the unenviable position of being handed the keys to an AWS environment with absolutely no explanation of its contents, documentation, or training. Whether an employee leaves the company, teams are restructured, or your company acquires another, you will need to quickly audit the account and get up to speed on its operation. Even worse, many of these inherited accounts are running production infrastructure that must be kept running during the transition period. Now that you’re responsible for this account, you will also be responsible for keeping it secure.
There is a wealth of documentation, training, guides, and other resources available online to learn about security in AWS cloud environments. But many of those resources assume that you are either building an account from scratch, were intimately involved in building the account from its inception, or can take great liberty in applying destructive changes. In our case, the reality is that you’re likely staring at eight years of accumulated infrastructure with absolutely no idea of what’s running or how to make changes without causing a production outage.
I’ve written this guide to help you filter through the mess, isolate the changes you need to make, and start to tame your environment. While I’ll assume that you have AWS experience, we’ll start with the security basics, along with changes that won’t impact running services, before moving to making tweaks that will require a bit more investigation and preparation. Our goal is to quickly triage the situation, implement the lowest risk but most impactful changes first, and then work our way toward a concrete security policy that can be used longer-term.
Note: The absolute best-case scenario when inheriting an account is to spin up a separate new account and migrate applications over time. However, I recognize that that is a pipe dream for many accounts, hence this guide was born.
This guide is not a substitute for a properly-designed security program. Instead, it is designed to be a quick-start guide for the first 30–90 days after assuming ownership over an account that may not have previously been properly managed.
Step 1: Get Stable Access
If you’re lucky, the target account is already configured to work with your organization’s Single Sign On (SSO) provider. In actuality, you’re more likely to have sticky note with an email address and a password on it. Our first step is to confirm access to the account and embed our own user to avoid losing access. This step is especially crucial if you’re taking over an account because a previous employee left the company.
If you were given a user account and password to sign in with, it’s possible this is the root user account. This is not a good practice, but first we need to stabilize our access by running through the following steps:
- Log in with the email and password to determine if you’re using the root account.
- If the credentials you have are not the root credentials, and you can’t access them elsewhere, you will need to contact AWS support to regain that access. This is extremely important to do quickly; AWS will need to verify your identity, which will likely require you to submit documentation on behalf of your company. You do not want anyone else outside of your company to have root access to the account; everything you do in the next steps will be pointless if someone else has this access.
- If the credentials you have are the root credentials, change them immediately. You can do this by clicking “My Security Credentials” from the account menu at the top right of the dashboard. If possible, change the email to a distribution list (that can receive external email) and that you are a member of. Choose a very strong password.
- If SSO is not currently setup, create a new IAM user using your own email and password and add the “Administrator” managed IAM policy to it. We’ll talk more about SSO later, but for now an email-based user is sufficient.
- Enable MFA for your new user.
Step 2: Stop Using the Root User
Your goal from this point forward is to stop using the root user entirely. To do this safely, we will need to make sure that nothing else is using the root user programmatically and then create an MFA token for the account which you will lock in a safe somewhere.
- Log in as the root user (hopefully for the last time).
- Determine if anyone (or anything) is using the root user’s account via the API by checking the root user’s security credentials.
- Again, with luck, you won’t see any in-use credentials here. But experience tells me that you’re more likely to find two in-use access keys that have been used within the last 24 hours and will spend the rest of your existence attempting to track these keys down.
- If you find keys that have not been used in a reasonable timeframe, delete them. If you’re unsure (maybe the key was used 112 days ago and you have a hunch it’s hardcoded on a production server in a closet at headquarters 1500 miles away), then take a note to come back to this because we need to fix it ASAP.
- Next, enable MFA on the root user account from the same page. Take a screenshot of the QR code key material and save it to Vault (or your company’s secret store) and then share the key with your boss or other trusted team members. Do not save the QR code on your phone (or if you need to to get the initialization codes, delete it immediately after).
Step 3: Update Billing Information
While finance may be happy with someone else paying for your AWS usage for as long as it takes them to discover the charge, you want to get this information changed quickly. This info will be used by AWS to help identify you if you need to recover the account, and you don’t want to get into a digital stalemate if the previous owner tries to claim ownership because their credit card is still footing the bill.
You’re probably going to need to involve finance for this one, so get a fruit basket queued up so they prioritize your ticket and don’t faint when you explain the incoming $142k/month charge they’re about to see.
Once you get the correct billing info, make sure to add it and then remove all other payment methods including bank accounts and credit cards.
If the account is a member of an existing AWS Organization (and you can confirm it’s not one owned by your company), leave the Organization. If your company uses Organizations, be very careful about joining it at this stage; it’s possible that existing Service Control Policies may affect running services or workflows. If billing must be handled through the Organization, you’ll need to discuss adding the account with the Organization admin for this use case to take advantage of “billing only” features.
Once your billing information is changed, it’s time to log out of the root account and switch to using the IAM user created earlier.
Step 4: Enable CloudTrail Logging and Monitoring
Keep in mind that at this point, you still have no idea who or what has access to the account, what is running, and what kinds of activity is occurring in it. Let’s fix this by turning on AWS CloudTrail.
- Open the CloudTrail console and determine if an existing trail is configured. If it is, you’ll want to verify that the logs are being sent to a location you have access to. If you don’t recognize the location, modify the trail to send its logs to your organization’s centralized S3 bucket used for log collection. If your organization doesn’t have such a bucket, configure CloudTrail to log to a bucket in your own account for now.
- Make sure to turn on CloudTrail’s optional security features, including encryption at-rest and file validation.
- This is also a good time to setup some basic metric alerts for critical security activity within the account. Ideally, these monitors would run in your organizations’ centralized logging environment (e.g. Splunk), but if that’s not possible, you can configure this trail to send its logs to CloudWatch, where you can configure metric alerts. I recommend setting up alerts for CloudTrail and IAM changes at a minimum.
There are many other AWS security solutions that may be helpful at this point, including:
One challenge you may have at this stage is identifying true security incidents from the noise. These services tend to begin producing thousands of results in a busy environment, which could lead you on an endless goose chase. I recommend enabling them in “audit mode” where possible, and returning later once the account is more carefully pruned.
Step 5: Cleanup IAM Entities
I once did some consulting work for a company that had close to 1,200 IAM users in their account, each with access keys. I nearly bit off my tongue during that walkthrough. If you’re in this situation, it’s easy to put these steps off until later. But it’s truly important to get a handle on IAM. A single user or access key with excessive permissions could compromise the entire environment. Our goal in this step is to cleanup users that have not been used in awhile, delete access keys where possible, and begin to at least scope the policies attached to each user.
- Download the IAM Credential Report for your account, which will contain a number of very important details for each IAM user.
- Start by isolating the easiest users to delete: those who have neither a password (i.e. non-console users) nor access keys or attached certificates. These users have no value (to us anyway; their parents likely still love them). Look for all of the following fields and values:
- password_enabled: false
- access_key_1_active: false
- access_key_2_active: false
- cert_1_active: false
- cert_2_active: false
- Once you’ve deleted these users, it’s time to move on to ones that do not have passwords but may have access keys used sufficiently long ago. “Sufficient” in this case is defined as “the length of time you’re willing to bet your job on a service not being used.” I’ve seen some franken-services arise after years of inactivity, so be careful.
- Next up are users who don’t have access keys, but do have passwords used sufficiently long ago. If Bob hasn’t logged in since Steve Jobs was at the helm at Apple, chances are he doesn’t need this account. Check the password_last_used field for this exercise.
Sleuthing for Users
At this point, hopefully you’ve cleaned out a significant portion of users who had access to the account. To handle the remaining ones, it’s time to do some sleuthing.
- Start with users who have both passwords and access keys. If you recognize them, send an email asking them what the keys are being used for and whether they can be disabled. Chances are they left a script running somewhere.
- If the username resembles an AIM screen name from your college days and you don’t recognize the user, we’ll need to get creative. I don’t necessarily recommend locking the account immediately, but if they have excessive permissions, it might be necessary. Just be careful not to disable the in-use access keys at this stage. Hopefully the users know where to find you if their access is revoked, so make sure to climb out of the server room and introduce yourself to the team. If you don’t hear anything in 90 days, chances are the user didn’t need that access and it can be permanently revoked.
- Repeat this process for users who have only passwords. These will be easier since they can be more safely deleted after a period of time after being locked.
We’ll now be left with a more manageable set of users who have either password or access key access to AWS (but ideally not both at the same time). From this list, I recommend placing them into three categories:
- Humans who need console access for legitimate businesses purposes.
- Machines using access keys outside of AWS (e.g. Jenkins running in a closet).
- Machines using access keys inside of AWS (e.g. EC2 servers, Lambda, etc.).
Preparing Account Policies
It won’t do much good to have users in Group 1 reset their passwords if they’re allowed to change the password to something simple. Be sure to first check the IAM Password Policy for the account and check all the applicable boxes per your organization’s password policy.
For our Group 1 users, work with them to ensure:
- Passwords that have not been reset within the expiration period are reset.
- MFA is enabled for their account.
- Their attached IAM policies are necessary for their job function. Use groups to manage this access where possible.
Tracking Down Access Keys
For Group 2 users, the hard part will be tracking down where the scripts are running. Fortunately, CloudTrail contains a wealth of information, including origin IP address, user agent headers, and other details that can be used to locate the user. When all else fails, you can always try doing a search of your organization’s GitHub installation in ̶h̶o̶p̶e̶s̶ fear the key has been committed there.
For Group 3 users, the goal is to transition them to using IAM roles, deprecate the access keys, and delete the users. This may be easier said than done, especially if these are legacy applications with no automated deployment process.
When all else fails, if the keys cannot be deleted, the next best option is scope their policies to just the services they need access to. Again, this isn’t an easy task, but there are tools that can help:
- Use the “Access Advisor” tool in IAM to see if the policies being granted to the user are actually being used.
- Use CloudTrail to see specific API calls, source data, and other details to determine if all permissions are necessary.
By now, you should be left with a more organized IAM environment, much more tightly-scoped IAM policies, and a properly configured account password policy so that humans can login (with passwords and MFA) and machines can access the necessary APIs (with access keys).
Note: Many organizations use Single Sign On internally, which is a more ideal method of configuring AWS access than password-based login for a variety of reasons, including user provisioning and deprecation. If SSO can be used, I recommend setting that up and transitioning your IAM users if possible.
Step 6: Locate Exposed Services
Aside from improperly-configured IAM users, your biggest security risk at this stage is likely to be services that are improperly configured to allow traffic from public endpoints. This includes:
- S3 Buckets set to allow public access
- EC2 and RDS instances and ELB/ALB/NLBs in public subnets with security groups allowing traffic from 0.0.0.0/0.
- ElastiCache instances configured with public access enabled, especially if a password is not set.
- EBS volumes, RDS backups, AMIs, and other storage backups that are shared with large numbers of accounts.
- KMS keys, SNS topics, SQS queues, and other services configured with global or cross-account access.
There isn’t enough storage space on Medium to walk through the detailed steps of fixing all of these issues, but the goal at this point is to plug the most egregious gaps. There are a number of open source auditing tools that can be used to quickly discover at-risk resources, but your biggest objectives should be:
- Closing ports and security group rules that are exposed publicly. You can use VPC Flow Logs (be careful, they can get expensive) to determine usage prior to closing ports.
- Locating S3 buckets that have insecure ACLs and/or bucket policies that allow public or global access. This will keep you out of the news; it’s important to do this quickly. Determining whether the bucket should have this access set will require you to consult with project owners and utilize S3 bucket access logs or CloudTrail S3 Object Logging to evaluate current usage requirements. You can also look into Amazon Macie, but be prepared to take out a reverse mortgage on your company’s fancy new office in SoMa.
- Removing wildcards in access policies for AMIs, EBS backups, and other objects. This is a medium-risk activity; public access is almost certainly not required for a production application, but cross-account access can be a valid use case, so turning a “*” into an account ID may prove difficult.
- In places where making changes could introduce downtime or you have a gut feeling that wildcard in an SNS policy is all that’s keeping your company’s multi-million dollar ERP system from biting the dust, the second-best option is to configure CloudWatch metrics based on CloudTrail logs to monitor for unintended access. Over time, you should get a better sense of what’s required and what can be removed.
Step 7: Lock Down Your Domains
Domains are the lifeblood of your organization’s applications and brand. If someone transfers that domain out of your Route53, a bad time is going to be had by everyone. In this step, your goals are to:
- Configure transfer locks on all of your supported domains.
- Remove domains that may be pointing to non-existent resources.
- Update technical details and contacts.
- Configure domains to auto-renew.
Enabling transfer locks will be an easy and non-destructive process. You can do this quickly via the Route53 console. The same is true for enabling auto-renewal.
Changing the technical and administrative contacts will be more time-consuming but is also a non-breaking change. Just be sure to use an email you have access to and that can receive email from outside sources so you can confirm the ownership.
If the domain is registered outside of Route53, you’ll need to track down the registrar and apply the changes there. If you’re up for a challenge, you can transfer the domains into Route53, but that is much more likely to lead to downtime if a mistake is made.
Domain Takeover via Unclaimed Resources
For domains that have records in Route53 pointing to S3 buckets, it is very important that you audit these records to ensure the bucket actually still exists. There is a very clever attack known as subdomain takeover, in which an attacker can take advantage of the global namespace in which S3 buckets operate to point your subdomain to a bucket they own.
You should take this opportunity to audit all domain records to ensure they are still in use and pointing to valid resources or endpoints.
Step 8: Find Expiring Certificates
AWS hides TLS certificates in two places:
- AWS ACM — a managed certificate service with its own dashboard in which certificates can be provisioned, renewed, and monitored.
- AWS IAM — an identity service with no UI option for locating available certificates.
Your challenge is to locate, rotate, and associate:
- Locate all certificates that are currently in use. I recommend using the APIs, including the list-server-certificates API call.
- Rotate expiring certificates.
- Associate that the new certificate with the correct EC2 instance, CloudFront distribution, AWS API Gateway, ELB, or other resource fronting the endpoint.
Step 9: Untangle The Web of Services
At this stage, we’ve avoided breaking things for as long as possible and done almost all we can without getting our hands too dirty. It’s time to start mapping existing running applications, shutting down unused services, and untangling the web of servers with names like “donotdeleteever.” Mistakes may be made.
There is really no ideal way to go about this process, but I generally like to do the following:
- Check every region for usage. Sometimes developers like to play cruel games of hide-and-seek by launching a c5d.24xlarge EC2 instance that costs $4.608 per hour in unused regions. If you discover resources like this, use CloudTrail, VPC Flow Logs, and CloudWatch metrics to determine whether they are in use. Once you’re confident, temporarily disable the resource by, for example, blocking network traffic to it. This gives you a good way to quickly restore access if you see a developer across the office immediately stand up and flip a desk.
- Use open source tools to map relationships between VPCs, security groups, NACLs, and other networking resources. If you are able to clean out a VPC, delete it and its sub-resources (e.g. default security group) to avoid future use.
- Start adding tags to resources to help you identify relationships in the future.
- Develop (or adopt) a naming convention for resources and rename ones you can.
- Check for potentially-compromised secrets. These include:
- CloudFormation parameter defaults
- Unencrypted Lambda environment variables
- EC2 instance data scripts with hardcoded secrets
- ECS task definitions with exposed environment variables
- Sensitive files on S3
- GitHub/code repositories at your organization that may contain committed access keys belonging to the AWS account.
- Locate potentially-compromised resources, such as EC2 instances, by looking at usage patterns. If that Windows Server 2008 box is sitting at 98% CPU utilization for 24 hours a day, chances are it could be mining cryptocurrency.
- Take inventory of everything. This is time consuming, but if you don’t know what’s running on a normal day, how will you know what shouldn’t be running the next? I’m a fan of NCC Group’s AWS Inventory tool.
- Stop all new development, if possible. If developers are continuing to deploy new services, before a proper security policy is in place, it’s a recipe for disaster. Set them up with a properly configured new account and shift development there. Use VPC peering and other cross-account functionality if they need access to services in the existing account.
Step 10: Monitor and Migrate
It’s important to recognize that you may never get this account into a “perfect” state. As I mentioned at the beginning of this article, there is no substitute for a brand new AWS account, provisioned from scratch to adhere to your organization’s security policies. Your goal should now be to migrate or deprecate services in this account as quickly as possible, with the eventual goal of full termination. This could be a multi-year effort.
For services that need to remain, monitoring will be key. If you can shift a majority of users and services to new accounts, this will reduce the attack surface and help protect your data. CloudTrail, with proper alerts, will help ensure that any unintended activity is quickly detected.
Being told you are now responsible for an account full of hundreds of legacy applications can be incredibly daunting. But hopefully, using the steps outlined here, you can begin to isolate and correct the worst security risks while containing and monitoring the rest. It’s not a substitute for an account that has been properly configured from the ground up, but what’s the alternative? Nuking the account and walking into the sunset?