Part 1: AWS Continuous Monitoring

Uber Privacy & Security
6 min readJun 15, 2020

--

Ashish Kurmi, Kaibo Ma, and Ankit Kumar, Security Engineering

Editor’s note: Part 2 of this series, including case studies, can be found here.

Uber uses a multi-cloud environment where public clouds supplement our on-premise infrastructure and enable service teams to rapidly innovate on top of a resilient, performant, and scalable foundation. But, multi-cloud environments increase complexity, which can challenge even the best security teams to develop unified security frameworks and enforce consistent security controls. Therefore, we’re sharing details about our general purpose multi-cloud security monitoring platform to help other teams working through their approach. Below we explain how we monitor our AWS environment using this platform and we’ll share details about our work with GCP in a future post.

Cloud Native Vs Cloud Agnostic

Cloud native applications can harness all capabilities of the underlying cloud platform by leveraging vendor specific proprietary technologies. On the other hand, cloud agnostic applications can deliver consistent performance wherever they are deployed because of their portability. We explored a variety of approaches and decided to use a hybrid approach to leverage the best of both worlds. Because vulnerability discovery and lifecycle management are very specific to the cloud service provider, these features are hosted inside our cloud native application. However, we want our response to be consistent across all cloud environments, so these features live inside a cloud-agnostic and on-premise service called Cloud Monitoring (CMON).

At a high level, CMON acts like a server while its cloud counterpart acts as clients. CMON can act on notifications from any cloud providers such as AWS or GCP as long as they satisfy certain interfaces. In our case, a cloud native solution in AWS discovers and manages cloud vulnerabilities then it passes updates to CMON via an SQS Queue. CMON then picks up these SQS notifications and acts accordingly. Essentially, this solution is the brain that drives security findings, while CMON maintains consistency.

Cloudy with a chance of Hammer

The cloud native monitoring tool we use at Uber is Hammer, an open source project built by the Dow Jones team for the AWS platform. Written in Python, this tool is a collection of Lambda functions that acts as configuration violation checkers across member AWS accounts and aggregates all the findings in an easy to manage set of DynamoDB tables in a centralized account.

In our experience, Hammer is robust, scalable, fault tolerant, and heavily customizable. Uber is an active contributor to this open source initiative and has a custom deployment running in our environment to suit our specific needs. Apart from the rich and ever-growing rule set, we added a number of features such as the real-time monitoring. Being active contributors to the Hammer project, we keep pushing these new functionalities and bug fixes upstream to the public repository to make them available to the community. Due to this modular nature of Hammer, we were able to couple our cloud agnostic service for tickets and notification management.

Hammer is our single point of integration for consuming all AWS security findings into CMON. From a security perspective it allows us to get a correlatable view of all our findings, enabling visibility across the larger organization, member accounts, and resources. This architecture also allows us to not only deploy Hammer-specific rules, but integrate AWS native solutions such as Trusted Advisor and third party solutions seamlessly.

Sunny outlook through CMON

CMON is an Uber on-premise GoLang service that acts on Hammer updates, including newly discovered issues as well as existing issues that have been resolved. Using bug details from Hammer, CMON calculates vulnerability ratings to help prioritize cloud security findings. CMON also integrates with other security services at Uber such as generating customized and actionable Jira tickets for resource owners. Below is a sample jira-ification of a security finding. Note how it has been hydrated with additional context such as risk level, solutions, and commentary.

[CMON automatically processes security issue lifecycle like opening, resolving as well as whitelisting. It also leverages our Engineering Security team’s vulnerability framework to calculate risk score and urgency. It provides a constant visibility through JIRA dashboards.

Why take a hybrid approach?

When we looked at existing cloud native, open source, and commercial third-party solutions, we realized that these solutions individually cannot achieve our cloud monitoring goals. Our hybrid platform takes advantage of all these solutions and delivers a great experience for our customers and stakeholders without reinventing the wheel.

Repurpose

Uber’s Engineering Security organization maintains a portfolio of security services and we wanted to leverage these services to also help manage security risk in the cloud. So we built CMON to integrate with our ever-growing services portfolio.

Automate

Hammer has built-in capabilities to discover and verify cloud security fixes. CMON has several automated features such as periodic follows-up/escalations, risk visibility, and incident management.

Empower

Our internal customers are empowered to triage cloud security findings based on risk, mitigate security issues, and mark findings as false positive all by themselves. All of our findings have a link to an issue-specific runbook that provides multiple remediation options along with practical tips to choose an option suitable for their requirements.

Scalable & cost effective

As the size of our AWS environment grows, we wanted the solution to grow on-demand without incurring significant cost changes. Hammer is a server-less platform that abstracts away complexities related to provisioning/scaling compute/database capacity. It allows us to pay for what AWS resources we actually use.

Extensible

We wanted to build a platform that could monitor Uber-specific policies. The platform should automate findings from third-party monitoring solutions as well. Hammer is highly extensible where each monitoring rule uses a set of Lambda functions and a DynamoDB table. Hammer is also resilient and fault tolerant where each monitoring rule is isolated and a faulty monitoring rule cannot bring down the whole system.

Operations made slightly less hard

Having built this system we realize it is similar to pedaling a bicycle; it increases our speed but it still needs a human operator. As anyone who does long distance biking knows, building the bike is the easy part. Actually riding the bike is hard. As we worked to action on the security findings, we hope to share some high level principles about mitigating security risks.

Shut down the Risk Factory

It can be demoralizing when you work on high risk tickets only to see new, identical tickets appearing. Because the work never ends, it can feel like you are running in place. If possible, it is worth implementing appropriate security measures to address recurring security issues. That way your team feels they are working towards an endgame. For all findings, CMON associates risk ratings. These metrics enable resource owners to go after most risky findings first.

One to Zero

On the same note, one to zero is a powerful motivation. Focusing on specific issues is more effective than a generalized attention on all issues. Reducing open risk items from 728 to 492 may not feel like a win but permanently closing out all tickets for a specific issue is a significant victory. Think of how you can break up large goals into one vs zero metrics that inspire and motivate your team.

Organic > Mechanical Outreach

We built CMON to bring the errors and alerts directly to the customer because of how important a role they play. Good judgement on customer outreach is quite important. One type is organic and one type is mechanical. As a general rule, customers respond much better to organic, personal outreach. Take some time to craft your tickets, messages, and even your automated emails to receive quick and productive responses from your customers.

Read more about our implementation with specific case studies in part two of this series here.

Acknowledgments

We’d like to thank the following:

  • Dow Jones Product Security for all the collaboration with Hammer
  • AWS Enterprise Account and Support Team for providing guidance throughout the process

--

--