How to Manage Cloud Security Productively
Reduce unnecessary overheads and security compliance issues across projects via controls and management in these four areas (with examples)
As more technology projects are being migrated into and being built in the cloud today, there needs to be a team managing the cloud infrastructure. I was part of the infrastructure team managing a couple of AWS accounts for the team.
As one of the pioneer teams in an organization to manage a cloud platform, there weren’t many existing processes in place for us to seek guidance. We had to learn a lot of things along the way, often through hard mistakes, e.g. spending many hours downgrading an application that caused breaking changes to other applications, etc.
We also had to deal with quite a couple of security issues, i.e. database getting compromised and locked, stolen keys and secrets, account being locked, etc. While security guidance online definitely has helped provide us with an idea on how to manage security in the cloud (configurations wise), many other factors influence security, i.e. internal processes, human negligence, etc. that could potentially cause unexpected issues.
I’ll be sharing on some of the lessons learnt and mitigation measures from my experience with managing cloud infrastructure accounts and components across multiple environments. The contents will also include maintenance and recommended security practices — from an AWS cloud administrator’s perspective.
While various teams would have their specific practices, I’ll be sharing what I’ve learnt from my experiences.
I’ve segmented the key learnings into the following four broad categories:
- Access control policies, groupings and accounts
- Automation of backups, encryption and patching
- Network security via jump boxes, private networks and VPN
- Management of resources and projects via tagging
Access control policies, groupings and accounts
When starting with a small team of fewer than ten people, there’s a tendency to provide everyone with privileged and programmatic access to create EC2 instances and other resources in the interest of agility for rapid prototyping purposes.
As the team grows over time, things shouldn’t remain status quo. More privileged users on the platform with programmatic access, e.g. keys and secrets, lead to an increased chance of negligence that results in high costs.
A sample case study: AWS keys and secrets (with privileged access) were placed into the codebase and pushed into a public code repository. As a result, a significant number of costly EC2 instances were created across various regions with outbound traffic to cryptocurrency mining resources; accounts were removed and locked out.
Some valuable lessons learnt and mitigation measures:
- Use AWS IAM Groups to segment users — groups can be created based on project-roles with its respective access policies to resources tied to it. Users can be added and removed from the groups (together with access to resources) easily when working across projects.
- Implement IAM policies to restrict user’s access to resources — to effectively restrict access to resources, they must first follow strict naming conventions, e.g. <project name>-<env>-<service>-<resource>, etc. which allows fine-grained access control via resource ARN and tags in the policies. Restrict creation of resources, i.e. EC2s to only specific roles, e.g. System/Project Admin; Best to follow the principle of least privilege. An elegant way I’ve seen to manage AWS policies would be via a CloudFormation script in a git managed centrally to mitigate unauthorized changes.
- Limit programmatic access — disable AWS keys and secrets for all users and restrict to DevOps systems and apps if required. For local serverless development and testing, consider using VPN for private network access coupled with AWS SAM.
- Enable AWS CloudTrail — provides visibility on the usage of AWS access keys and secrets/APIs to determine which keys are compromised/misused so you can disable it quickly.
- Enable budget limit notification — an obvious sign of keys and secrets compromised is the spike in costs given a large number of resources created. In such cases, as a system admin, you’ll want to be notified via email, etc. to investigate further.
- Avoid using root account — in the event that all your users get wiped or locked out, the root account can be used to re-create all the users; protect the root account heavily with MFA, etc. and avoid using it for normal usage.
- Rotate keys and passwords frequently — rotate AWS keys, secrets and passwords every 90 days (as a guideline); this is also a chance to clean up unnecessary key access and user accounts.
Automation of backups, encryption and patching
As the number of applications in your infrastructure increases, the overheads can get overwhelming — especially in environments with high-security compliance requirements, e.g. PCI.
Having to manage security requirements, i.e. app and security patches, etc. while ensuring that all applications are running as expected is a challenging task due to security and convenience tradeoffs, etc.
A sample case study: Due to the security requirements of having all applications patched continuously and kept up to date, automated scripts were placed into machines to update all applications and systems periodically. One day, an upstream system stopped working. Upon further investigation, the root cause was due to a breaking change in an underlying system that got updated. Service disruptions lasted for a couple of hours with on-going firefighting while rushing for another project.
Some important lessons learnt and mitigation measures:
- Set up automatic backups — using AWS lifecycle manager, a backup schedule can be configured, e.g. two to four times a day depending on your environment. When something critical happens, you can easily revert the volume to the latest working version by switching the volumes to keep services running while you investigate the issue.
- Automate patching selectively — use AWS run command and patch manager or system CRON jobs, i.e. daily/weekly CRON scripts, to patch your applications and systems. However, take note of application-level patches that cause breaking changes. It’s safer to automate security patches but leave application patching to a specified time period where appropriate impact assessment can be done before patching.
- Setup jobs to monitor and remediate security issues — a simple script can be set up to monitor changes to your infra settings (if you’re not already using AWS Config for governance). Occasionally, you may have a project lead or admin making changes, i.e. opening a network port and forgetting to close it after. The automated job can be run on AWS CloudWatch and Lambda (via AWS SDKs) to detect/remediate and notify your team for such changes before the security team flags out.
- Enable default encryption whenever possible — default encryption can be enabled for resources, i.e. EBSes, S3 buckets, etc. By enabling default encryption settings, it can save you a lot of trouble with the security team, especially regarding autoscaling EC2s.
Network security via jump boxes, private networks and VPN
Having a well-designed architecture is essential to scale your applications and projects. A good architecture consists of multiple network layers and subnets to support high-availability and scalability — like the example shown below.
However, the value of such a setup (as above) is difficult to justify (can get quite costly) when the team is small and applications are scarce. Often, in such scenarios, basic set up without proper network boundaries and appropriate security settings are implemented — in the interest of speed, costs and simplicity, e.g. without private subnets and NAT gateways, etc.
A sample case study: In the interest of costs and speed, a MongoDB instance was set up in a public subnet for prototyping purpose with intentions to move it to a private subnet in the future. One day, the MongoDB’s data got wiped out leaving one row in the document that states: “Your DB is backed up at our servers, to restore your database, send 0.1 BTC to the Bitcoin Address XXX, then send an email with your server IP to XXX.”. Investigations reported that one of the developers forgot to update the port settings after testing and left it open to the public; it turns out that there are constant threats like this every day.
Some important lessons learnt and mitigation measures:
- Secure your networks — only open required ports to specific IP addresses, i.e. VPN, and close the rest. A good practice is to only open ports 80/443 to the public (All) for inbound and outbound for public applications, and restrict the rest to specific IP address ranges.
- Place databases in private networks—usually database instances wouldn’t require outbound internet connectivity; assuming a small instance is used solely to host the database without other applications needing outbound internet access, the database can be placed in a private subnet without being exposed to the public internet.
- Use a VPN — especially useful for development purposes to whitelist a static IP address and private access networks. You can set up a VPN server using the open-source VPN (OpenVPN) via AWS; it’s generally free to use if your requirements aren’t too complicated. You’ll just have to pay for a small instance hosting cost and get full admin management features; or
- Use a bastion host — serves almost the same purpose as a VPN to access resources in the internal network via SSH tunnelling. Some of the additional benefits include logging, central management, and serves as an additional layer of security.
Either a VPN or bastion host would usually suffice. The usage of either or both is largely dependent on factors, e.g. your organizational structure and requirements, budget, IT expertise, etc.
Management of resources and projects via tagging
The management of resources wasn’t a problem to worry about until projects and applications scale. Applications and resources can pile up very quickly, especially with automated DevOps and the use of Serverless frameworks, etc.
Things can get messy and disorganized with multiple projects deploying resources into your account in across environments for various purposes on different occasions. When there are issues with any of the resources, e.g. security compliance or violation, etc. and your resources are not managed properly, you’ll have a hard time figuring out who from which project to contact to resolve the issues.
A sample case study: Teams across projects deploy applications and resources into cloud infrastructure for prototyping purposes via Serverless framework and CloudFormation templates. One day, the security team flagged a zero-day attack on a resource which requires immediate attention. The infrastructure team had to spend time and effort to find the owner of the resource to get sufficient context and resolve the issue.
Some important lessons learnt and measures to be taken:
- Tag all resources and follow naming standards/conventions — use AWS resource tagging feature and apply tags, i.e. AdminName (owner), AssetType (project name), Env (dev/prod), Name (asset name), Accessibility (public/private), etc. to help manage your resources better in areas of billing, security fixes, etc.
- Incorporate resource tagging into process governance — a good way to govern processes would be to implement resource tagging as a requirement into the DevOps processes, Terraform/Cloudformation scripts, etc. to ensure all deployed resources are adequately tagged.
- Communicate standards across projects — establish and propose naming conventions and standards for tags and resource names to facilitate ease of management in areas of billing, access control, and resource management. Communicate the recommendations to project managers and architects to ensure that these standards are applied across projects to make the system administrator’s life easier. You can reference a sample of cloud resource naming convention by Harvard University IT to create your own for distribution.
Thanks for reading, and I hope this will help you better manage your cloud infrastructure and save you time and effort in the long run. :)