Security at Scale — Rapid AWS Asset Tracking, Configuration Monitoring, and Governance Enforcement using AWS Native Services
In lieu of attending AWS re:Invent 2018(currently wrapping up a sabbatical 😎), I figure I can feel less torn about missing my favorite conference of the year by finishing a series of articles I started after last year’s re:Invent covering the tooling Asha Chakrabarty (Principal Product Architect, AWS R&D Innovation Team) and I presented in SID 333 “Security at Scale: How Autodesk Leverages Native AWS Technologies to Provide Uniformly Scalable Security Capabilities — SID333 — re:Invent 2017”. I’ll start this series with the third tool in that presentation as it has generated the most interest based on past presentations, LinkedIn messages, and general requests for more information.
It is important to note that when I brought the prototypes and concepts for each of these tools to Autodesk, collaborating and building them out with talented engineers, architects, and TPMs improved the implementation details ten fold. However, those details are not mine alone to share and this article will instead focus on the design concepts which inspired what we ended up deploying rather than the proprietary implementation details. These concepts are enough for an organization or team to then take, optimize, and implement in an architecture that best fits with their unique requirements in the same way we did.
The first tool I’ll cover will be a design focused on increasing AWS resource visibility and automated governance enforcement capabilities.
Asset/resource management is a major pain point for many organizations that are built on AWS and more so for those that have numerous AWS accounts and tens of thousands of assets. Dynamic environments add complexity to the challenge and it can sometimes be difficult to keep a reliable inventory of assets alone without even validating that the configuration of each one is in line with your organizational policies and standards.
The AWS SDKs offer the ability to query and list all resources in your environment to show you the what and the where (What is the resource ID? Is it configured in line with our policies/standards? Which Region/Account/Subnet/etc is it in?). While this is helpful, it requires automation to be continuously ran to pick up new assets and modify the stored state of existing assets all the while constantly chasing yet never catching up with ephemeral infrastructure which is, for many use cases, the reason we all moved to the cloud in the first place. It also provides no audit history showing who made the change or when.
Enter CloudTrail. CloudTrail is an AWS Service that records the API calls made to your AWS accounts including automation using AWS SDKs like an application written in Boto3, human users clicking through the AWS console, and calls made using the AWS CLI. Through CloudTrail, almost every action performed against an AWS resource is recorded along with who made the request. With the rapid feature and new services added to AWS on a continuous basis, CloudTrail itself is constantly being updated to keep up — you can keep track of unsupported services at https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-unsupported-aws-services.html.
In addition to collecting the what and where , CloudTrail can also tell us the who, when, and how (Which user or service account made this change? How long has this misconfiguration been in place? Did they use the console, CLI, or SDK and were they in the office or using a strange foreign IP where our organization has no presence?). Since CloudTrail is constantly monitoring and reporting, changes are detected when they occur so there is no longer a risk of missing assets or changes in between scheduled queries of your infrastructure that would be required using just the AWS SDKs.
A brief CloudTrail event breakdown from the sample above: The userIdentity object is included with all CloudTrail events and gives the session information of the user/role/account that made the request. Other standard fields include eventTime, eventName, eventSource, etc. The requestParameters and responseParameters change based on the resource type and call being made and often contain the most useful asset configuration information.
As you can imagine, the capabilities an organization can build by storing, analyzing, and responding to the information that can be extracted from CloudTrail are highly desirable:
- Incident Response can pinpoint the owner of a compromised EC2 instance within seconds as well as similar calls made to every EC2 instance in your entire environment from an outside source of interest to reduce time to containment and MTTR of an incident
- Forensics can take a look at the request history made to an orphaned and forgotten EBS volume of interest without getting lost in Slack threads and email chains chasing old teams and ex employees
- Threat Intel can begin to baseline the various interactions your teams have with AWS services as a first step to detecting abnormalities and also have the holy grail of comprehensive operational data to correlate outside IOCs and intel lists against
- Security Operations can begin to automate corrective enforcements to reduce the time they spend correcting simple snafus like open security groups or overly permissive IAM permissions which frees up their time to investigate higher level (and more challenging/rewarding) findings
- Security Engineering can identify the most commonly used AWS services and associated anti patterns to ensure technical security strategy properly aligns with solving for the most critical cultural/technical gaps and can translate compliance controls to logical checks and responses to ensure 100% compliance by never allowing resources to go out of compliance in the first place.
- Security Architecture has the larger picture visibility to better understand how proposed designs and implementations would fit into an existing environments, reducing blind spots and greatly increasing the understanding of risks and benefits of proposals
- Security leadership can sleep better knowing that there are less visibility gaps that could result in a major incident and that their strategy is based on the real state of the current environment rather than assumptions, best guesses, or industry trends
To add to this list, there are countless uses for non security teams:
- Network Engineering can see every VPC, Subnet, Security Group, Route Table, and NACL, EIP, ENI, Dx, Peer, etc configuration and monitor all changes leading to quicker troubleshooting and less guesswork
- Infrastructure Engineering can enforce asset tagging and resource management requirements by quickly responding to resources that are launched with incorrect configurations and the ability to constantly report the current state of tagging per account, team, env, etc
- Operations can take advantage of the fact that CloudTrail, for better or for worse, does catch authorization errors and failed calls which can be a valuable, near real time indicator of a small issue that can be addressed before creating cascading failures through your environment and growing into a larger incident
- Finance can configure notifications based on thresholds unique to your environment that will trigger on potential cost spikes, before an actual AWS billing alert could
- Insert machine learning plug here — Your data team can probably generate some useful insight/predictions when given free access to the thousands, tens of thousands, or millions of calls your organization makes to AWS each day. Better predict and refine your autoscaling needs, arm finance with custom AWS service use predictions before they negotiate your next Enterprise deal with AWS, etc.
Basic Use Case:
This takes us to the fundamental, bare bones design. Let’s say your organization is interested in addressing low hanging fruit — Security groups that allow open SSH access, instance-profiles being assigned the AWS Administrator role, EC2 instances launched in forbidden regions, and other actions that can be reduced to a “parameter” (EventName=RunInstances) = “value” (Region=US-East-1) correlation. Let’s call these low dimensional — it’s a simple match for a pair or more of parameters that dictate whether the CloudTrail action picked up detects an action that created a resource with an invalid configuration or not. There is no counter, time interval, or threshold involved, no correlation is performed against other input sources, and input data is processed and discarded.
For cost and capacity planning, your organization may prefer to take a phased approach. If so, this fundamental design is a good first step that will allow you to see how many events you receive, how many resources you are dealing with, and what kind of ingestion stream routing architecture and logic layer code structure works best for your intake and use cases. This allows you to experiment a bit before fully scaling and gives you better numbers to use for estimating the cost of building out the full active cache of your environment state.
Finally, your organization is ok with the initial response streams reactions being in minutes rather than seconds.
Basic Architecture Overview:
This solution has three components: Ingestion streams, a Logic Layer, and Response Streams. The Ingestion Streams centralize all CloudTrail events from multiple source accounts to a single bucket in a central account, use an initial lambda function to parse each event, and then route any matching events to dedicated SNS topics. These SNS topics feed directly into the Logic Layer which is a collection of lambda functions which are compartmentalized based on the SNS topic architecture so that each lambda is only launched when an event type that matches its check comes in. Finally, if a finding is detected in the Logic Layer, the event and associated information is passed on to one or more appropriate Response Streams. The input and output of each stream and the Logic Layer are standardized to allow for a modular design that lends itself to scale, operational simplicity, and extensibility by allowing simple additions and modifications to the platform for additional processing logic, responses, and inputs.
Basic Architecture Configuration, Ingestion Streams
When enabling CloudTrail, select the option to send all CloudTrail to a central S3 bucket. This will work across accounts and offers a central place for your organization to collect all actions. Of course, your organization might want to set up a bucket for Dev, Stage, Prod, separate by products or locality, or keep as one central bucket (which is the example we will continue with for simplicity). This solution can easily be adapted to whichever CloudTrail + S3 architecture you go with.
When configuring this way, instead of each CloudTrail event being streamed to S3 individually, AWS will collect all events captured over time or up to a certain limit and then send a zip file with all events to your central bucket. This is where the time factor comes in — this can be within seconds or up to 10–15 minutes. I’ll dive into a real time tweak that might work better for your organization in Upgrade #1 later on.
After the S3 bucket is created and source CloudTrail trails are configured, create the “Ingestion Lambda” and configure it to trigger when any object is added to your central CloudTrail S3 bucket.
This ingestion lambda will be triggered when the zip file above is delivered and will perform the following actions:
1 — Download and unzip the file using the object key information provided in the event triggering the lambda.
2 — Parse each CloudTrail event in the file and see if the event matches an AWS Service, Resource, or EventName that you have configuration validation logic for (IE ServiceName = IAM, EventName = RunInstances)
3 — Route matching events to dedicated SNS topics. You could assign these SNS topics to AWS Services — EC2, IAM, RDS, jump down a level and make them resource based (ELB, Instance, Security Group), or, for highly frequent events like EC2 RunInstances, assign to an SNS topic just for that event. Make this decision after looking at your use cases and the content of your CloudTrail trails.
4 — Drop the events that do not match criteria you are checking against. This does not mean these events are deleted — they still exist in your CloudTrail bucket for the lifecycle policy you have assigned and can be available for Athena processing, IR/Forensics, or as saved logs/artifacts in accordance with the retention policies of your organization.
Completing the ingestion layer are the routed SNS topics which carry specific event types for targeted processing. As mentioned above, you could choose to route by Service, Resource, EventType, or something else. The point of this routing step is to group all like events of interest into similar streams so that they only trigger logic layer lambdas that pertain to their event type.
Basic Architecture Configuration, Logic Layer
The logic layer is comprised of a series of lambdas that contain the core logic of the platform. There are a lot of decisions to be made here based on your unique needs that will determine how you organize your lambdas, how your routed SNS topics from the ingestion layer should be structured, and what you are looking for in the first place. As mentioned above, most AWS standards/benchmarks/policies can be codified into parameter = accepted value pairs whether these rules are pulled from a public list like the CIS Benchmarks or internal corporate policies. We’ll continue with the EC2 runInstances example from above.
In this scenario, your organization sees a large amount of EC2 runInstances calls and decided that amount reached the threshold to warrant a custom ingestion stream. So, your ingestion lambda sends all runInstances events to a dedicated SNS topic.
You want to verify that all instances are launched in an approved region, with an approved instance type/size, with tags. You codify these checks into lambdas and set them each to trigger off of the runInstances SNS topic.
Now, a runInstances event detected at the Ingestion lambda will be routed to its dedicated topic which will then trigger each logic layer lambda (in parallel) to ensure the instance meets requirements for approved region, instance type/size, and tags. A note on this logic layer architecture — there are benefits to dedicating a lambda to each policy/standard check while there are also benefits to grouping each like check into one larger lambda. Your organizational guidelines, use cases, and bandwidth should be used to determine which is right.
At this point, you are collecting all events, parsing for the events that are interesting be it a specific event , resource, or service type, then separating matching events into routed topics which trigger lambdas that run your logic to validate approved configurations.
Basic Architecture Configuration, Response Streams
Continuing the SNS -> Lambda pattern, a final stream is deployed. This time, an SNS topic for each potential response is created.
For example, if your organization uses a ticket tracking system like Jira or Service Now, you can create a stream that will create tickets and assign to owners (with security as viewers =)) for misconfigurations to fix. This stream would trigger a lambda that is a just a method which takes standard parameters such as the resource id/name, UserIdentity of the user who performed the action, and a link to an internal wiki where you have details on why this is an issue, what the approved configurations are, and instructions/commands to remediate, and creates a ticket with these details assigned to the responsible team/person.
Alternatively, a finding can be sent to an enforcement SNS topic which will trigger an enforcement action to automatically remediate the issue. I’m working on a dedicated article regarding automated governance enforcement– getting the authority/signoff, integrating with change management processes, how to test/promote new enforcements within CI/CD, how to securely configure the permissions for these highly powerful actions, etc. There is a lot to this step and many program maturity milestones required, so for the sake of this article will leave it at that for now.
Here you have the basic outline — An Ingestion streams for collecting events and doing the initial routing, a Logic Layer to run your governance logic against events to search for violations, and Response Streams to take action if warranted.
Upgrade #1 Use Case:
The first improvement we can make to this design is to address the delay in CloudTrail -> central CloudTrail S3 bucket that occurs with the basic configuration. This solution requires a slightly higher AWS Infrastructure maturity level in that it moves some steps in the ingestion layer from a central account to each source account. The complexity this incurs directly relates to how many AWS accounts and regions your organization uses as well as the ability (technically and politically) your team has to deploy and manage these additional resources in AWS accounts owned by other teams in your organization.
Upgrade #1 Configuration:
Instead of using CloudTrail to zip and send all events to a central S3 buckets, CloudWatch can be utilized to be triggered in real time as a CloudTrail event occurs and process that event for events of interest, similar to what the ingestion lambda does in Architecture #1. This no longer occurs in the central account, but in the source account itself. If an event is found, it can be written to a CloudWatch event bus which sends each event from your source accounts to the central account. The Ingestion Lambda is configured to trigger off events from this central event bus instead of the CloudTrail S3 bucket and the rest of the architecture can remain in place.
If you are already thinking about how to add more source types, I covered this pattern in the re:Invent 2017 GuardDuty announcement presentation as a way to add GuardDuty as an input source to this platform in addition to CloudTrail. https://www.youtube.com/watch?v=Imjbh0WPSR4&t=2400s.
Upgrade #2 Use Case:
A major improvement that can be made to #1 whether #2 is in place or not is the ability to save the configuration of each of your assets as a change is made, essentially creating a near real time cache of the state of your environment. This provides a reliable, near real time cache for the current state of every single AWS asset in your organization which, as I’m sure you can imagine, becomes a valuable organizational data lake for a variety of use cases.
In the specific scope of how this cache expands the capabilities of this platform, you can now deploy higher dimensional checks to be configured like time based thresholds (amount of large sized instances launched within a 5 minute interval) or checks that validate against other input sources like an exception table for approved users/accounts/tags etc that are exempt from a response.
Upgrade #2 Configuration:
This is yet another change to the ingestion layer and does not matter whether you are using the CloudTrail to S3 or CloudTrail to CloudWatch architecture. Instead of routing events of interest to Routed SNS Topics, you can write their results directly to DynamoDB tables, noted here as Resource Tables, and then use DynamoDB Streams to create the Routed SNS Topics from above. This creates a queryable cache for anyone in your organization to access while maintaining the event driven processing and resulting automation created before.
For example, if an instance is launched, the instance information in the runInstances event is written an EC2 instance table with all associated information — who launched it, in what account, what region, which instance profile, etc. Similarly, if the instance is stopped, tagged, or modified, that update is written to its item in DynamoDB to reflect this change. Again, how you organize these tables is up to you the example above dedicates a table per AWS Resource Type.
The Logic Layer and Response Streams created in the basic architecture can be plugged in directly here and you now have both event driven as well as on demand processing capabilities.
If you’ve made it this far, thank you for reading my first article! This was written in cramped car rides, overheated planes, and in between long sections of no internet connectivity while I traversed 7 countries and 4 continents over 9 weeks of sabbatical so I know its a bit rough around the edges and would appreciate any feedback. One of my sabbatical goals was to get at least one article out before I return next week and this is the one that I’m pushing in current state. Please provide candid feedback if you have any whether it be edits, questions, or comments.
These are the most common and/or best questions I’ve gotten when presenting this material over the past two years:
Will you be open sourcing this?!?
At this point I cannot comment on when/if Autodesk’s implementation (Pulsar) will be open sourced but over this past summer I met Michael Grima (Senior Security Engineer, Netflix) and we discovered we’re working on different implementations of a similar concept. The good news is that Netlix’s implementation is open sourced, well written, well designed, AND Mike is presenting at re:Invent 2018 “SEC391 — Inventory, Track, and Respond to AWS Asset Changes within Seconds at Scale”. This tool is currently available to the public now, but I’m avoiding using the name or linking to it so that you can get the details from Mike himself during his talk 😉 Its sure to be a great presentation, I’m very bummed I’m missing this one!
Why not put all logic into the ingestion lambda and/or consolidate some of these steps? It seems like there are a lot of unnecessary components, especially with so many Lambdas going to SNS Topics which trigger more Lambdas — You know it is possible to trigger a lambda from another lambda?
Separating routing and logic in this way allows for far more extensibility down the line. Other teams can add custom checks to stream you already have without needing to update your ingestion, routing, and logic layer code. It is also much easier to test new deployments, and maintain code and data standardization. The examples in this articles are simple — as more advanced use cases are deployed, having these functions separate scales much easier.
You mention streams — why not Kinesis?
For the use cases presented here, Kinesis is not required. However, I recommend checking out Airbnb’s StreamAlert as both a way to use Kinesis to build a similar platform that has the benefit of grabbing multiple input sources https://github.com/airbnb/streamalert
There are a number of products on the market that offer this service — why build when you can buy?
There are indeed many 3rd party products and services that offer asset/resource management, real time compliance checks, policy enforcement, and many variations of the benefits mentioned above. The decision to buy vs build should be based on your engineering resources, budget, capabilities, and culture.
These products tend to be pricey and the truth is that AWS gives an organization all that it needs to build these services on their own for a much lower operating cost. As an added bonus, this type of project is a great way to get your engineers comfortable with building and supporting a server less (excuse me, FaaS) platform on AWS. Second bonus — this platform allows to to own and have unrestrained access to your own operational data. Query limits
However, it is valid to point out that most of the hard work starts when your team begins to look into fixing the configuration issues this platform detects rather than actually detecting them.Purchasing a tool that skips your team right to the remediation aspect certainly has the benefit of focusing your team on remediation quicker.
What if Cloudtrail misses events? How about resources that exist before you turn on CloudTrail?
There are a few ways to get around this. For example, you can use the SDKs to pull in all account information every day, week, month, or when onboarding a new account, to update Resource Tables with current state and monitor for any discrepancies.
You can solve a lot of these issues using IAM permissions. Why put effort into enforcement when you can simply prevent?
Many organizations do not have perfect IAM in place and the process for updating IAM across organizations can be a large, slow undertaking. This tool is for companies who are in a situation where having a notification and reaction platform will yield results quicker than overhauling their existing IAM program or see value in having a notification and reaction platform in place while they move towards “perfect” IAM and prevention capabilities.