Detection Engineering for Cloud-Native Security

Published in

The Startup

9 min readMar 19, 2020

by Jamie Lewis, Venture Partner

The rapid move to cloud-native architectures is having a profound impact on enterprise security posture and operations. In the world of containers, microservices, and orchestration frameworks, the notions of an “application” running on a “machine” in a persistent “state” are obsolete. The application, or service, is now a distributed system, consisting of multiple components running on a highly variable number of nodes, in a nearly constant state of change. Traditional security controls that rely on machine isolation and a predictable system state are ineffective. Security policies that are blind to service-to-service communications and controls that lack horizontal scalability simply cannot keep pace with today’s microservice applications.

To protect business assets in cloud-native environments, organizations must bring security practices and technologies into architectural alignment with the systems they are meant to protect. Just as DevOps enables continuous development and deployment pipelines, “DevSecOps” must enable continuous security pipelines. That means establishing new methods, capabilities, and instrumentation, ensuring that systems designed to protect cloud-native systems embody these basic characteristics:

Extensive, real-time visibility: Partial or after-the-fact visibility will not suffice. Both the infrastructure layer and applications, wherever they are, must be visible.
Rapid, iterative feedback loops: Feedback loops allow security measures to adapt continually to rapidly changing environments.
An engineering approach to solving security problems: Automation, continuous measurement, and controlled experimentation will be the predominant method of solving security problems across the enterprise, replacing manual analysis and control updates.

Detection engineering is a pivotal aspect of this alignment. Detection engineering uses automation and leverages the cloud-native stack to discover threats and orchestrate responses before they can do significant damage. As part of a move to DevSecOps, detection engineering can improve security posture. Consequently, organizations making the transition to cloud-native architectures should consider how and when to incorporate detection into their security programs.

Detection Engineering

Detection engineering is the continuous process of deploying, tuning, and operating automated infrastructure for finding active threats in systems and orchestrating responses. Indeed, both the terms “detection” and “engineering” carry important connotations when it comes to the new approaches to security we’re discussing.

Detection: The debate over preventative versus detective controls isn’t new. Conventional wisdom holds that the key is finding the right balance between the two given an organization’s risk profile. But most enterprises have invested significantly more in prevention than they have in detection. In fact, much of the cybersecurity industry has focused on the creation, marketing, and sale of prevention technologies and products. But those products are failing. Insider threats, social engineering, zero-day attacks, determined (often state-sponsored) attackers, and many other factors have made an over-reliance on prevention a losing bet. It’s simply smarter, and more effective, for security managers to focus on detection rather than attempting to build impenetrable systems.

Engineering: The goal of any successful alerting system is to separate signal from noise, distilling meaningful and actionable alerts from the collection of event information, moving them up the chain for remediation and response. In a typical security operations center (SOC), analysts process those alerts, determining their severity and whether to escalate them to a higher-level analyst or incident response team. Processing alerts involves compiling contextual data (who, what system, how it’s used, what roles, and so on), filtering according to some or all of that contextual data, comparing events with threat feeds, and assembling a coherent picture of what happened — all before deciding what to do about the alert.

Given the sheer volume of alerts and event logs complex systems can create, this is a staggering task at best. Overwhelmed by mountains of busywork, security programs suffer from analyst burnout and alert fatigue. As analysts become desensitized to the staggering data load, real problems slip through their fingers. As Ryan McGeehan said in a recent post, “when a human being is needed to manually receive an alert, contextualize it, investigate it, and mitigate it . . . it is a declaration of failure.”

Consequently, organizations such as Netflix, Lyft, and Square have started treating threat detection as an engineering problem, using automation to avoid these pitfalls and make security teams more effective. They are also avoiding the silos that separate detection, response, and development teams, following the DevOps mindset when building detection mechanisms and integrating them with response orchestrations.

Detection Engineering Infrastructure

In practice, implementing detection engineering requires an integrated infrastructure that consists of the following components:

Data sources
Event pipelines
The correlation engine
Response orchestrations
Testing and feedback loops

Our definitions and examples of detection engineering infrastructure components rely heavily on the work of Alex Maestretti, formerly the engineering manager of Netflix’s detection and response team, and Ryan McGeehan. We thank them for sharing their insights with us and allowing us to reference them here.

Data Sources

An effective detection system must start with data about the applications and infrastructure that it is protecting. But getting the right (and good) data is not always easy. Some of the key aspects to consider are:

Cloud-native visibility: The cloud-native architecture has introduced new layers and components such as containers, orchestrators, service mesh, and others. Collecting data from these systems will likely require new tools and instrumentation.
Align instrumentation and log consumption: Instead of creating a central team that deals with logs written by others, security engineers should work directly with DevOps to instrument data collection. The security organization should also hold those engineers accountable for the results. Netflix, for instance, holds the philosophy that the person who logs the data should be the person who consumes that data. This brings a new level of rigor and discipline to the process of creating useful logs.
Find the best sources: Different systems support different levels of instrumentation and use different data schemas and models. Security managers need to find and understand the different logs and formats and decide which are relevant to threat detection.
Decentralizing detection The traditional detection architecture focuses on gathering a massive amount of data into a centralized data lake and run detection algorithm on it. This means detection can hardly be in real time as the amount of analysis needed is enormous. Utilizing a distributed architecture that performs detection closer to major sources of data allows for much faster detection and higher quality of data to be gathered.

Event Pipelines

Event pipelines gather event data, streamline them via automated mechanisms, and prepare them for further examination before a human ever has to see them. To ensure effective detective controls, security teams must consider these issues when building event pipelines:

Stream vs. Batch: Today, many SaaS products and most legacy systems can’t push log and instrumentation data to the security system, making batch processes a necessity. But organizations should instrument services to stream data in real time to event pipelines when practical and possible.
Normalization vs. Workflows: Security teams should avoid the difficult work of normalizing event data by creating pipelines for each data source, basing its workflow on the data in that system. Templates and reusable modules to streamline work on common data types can streamline these efforts.
Alert Frameworks: A rigorous framework for creating events and alerts, and the rules that drive them, is essential. The process should follow the same engineering standards that govern software projects, making alerts subject to peer review and using a version-controlled repository. And the person who writes an alert should be accountable for its results. Palantir’s incident response team posted an excellent write up on such a framework.
Enrichment: The low quality of security alerts is a major contributor to alert fatigue. Automation should enrich event data, adding critical information, eliminating manual labor, and improving alert quality. In general, less-expensive enrichments — such as a health and schema check or lookups for specific values, such as the geo-location of an IP address — occur in the event pipeline.
Machine Learning: Proven machine learning techniques increase the ability of the system to detect active threats and conditions that warrant further investigation, reducing false positives. Algorithms for feature extraction run on the data, building models of actions, looking for anomalies and conditions, generating model-based events that drive event triggers.
Forensic Data Storage: Detection engineering infrastructure should store and archive event data, which can be crucial when it comes to forensic activity.

The Correlation Engine

The event pipeline passes only the events warranting further inspection onto the correlation engine, which will ultimately determine automated responses to alerts, including notifying humans. Key factors include:

Choice of Platform: The correlation engine is typically a data analytics platform. Commercial detection engineering products such as Capsule8 include their own data analytics engine. Some organizations rely on Splunk while others use Elasticsearch. (Disclosure: Rain Capital has an investment in Capsule8.)
Further Enrichment: The correlation engine uses rules and more expensive enrichment to determine whether the system triggers an alert sent to a human or invokes automated response mechanisms. Security teams will need additional tools and services to gather and include data such as the user accounts involved in an event, their security classification, who they work for, and their contact information. Context-specific information can include screenshots of what the user saw (in the case of a phishing attack, for example), how a given operation was launched (manually vs. automatically), what privilege levels were used, and any privilege escalation that occurred. Automated communication mechanisms, such as Slackbots, can reach out to get confirmations of activity or more information from users.

Response Orchestration

Automation should do as much as possible to mitigate events as part of response orchestration, before an alert reaches a human. When it does reach a human, a properly enriched alert should contain a reasonable set of response actions. The right set of response actions will depend, of course, on the application and the type of event. In general, however, these are some of the factors to consider:

Automate common fixes: If a common temporary fix involves re-deploying a cluster in Kubernetes, for example, a workflow could re-deploy it automatically, triggered by a rule. Alternatively, an alert could include re-deployment as an option, allowing the human receiving the alert to do so with just the click of a mouse, saving a great deal of time and effort.
Build in communication: Effective response and escalation orchestration should include mechanisms that can quickly bring users, developers, and other relevant players into the communication loop. Integration with Slackbots is often a key component of detection engineering. Dropbox recently released Securitybot, a communication mechanism designed specifically for integration with detection engineering systems, under an open source license.

Testing and Feedback Loops

Given a set of response options, the security team should capture feedback on how alerts are working for alert tuning and improvement. These feedback loops should include end-to-end integration testing. This includes working with red team and other testing efforts. McGeehan says security teams should “treat detection the same way you’d treat a build pipeline supported by CI/CD platforms like Jenkins.” He recommends simple scenario tests and writing direct attacks on the detection mechanisms. Canary testing is another useful technique for testing changes to controls. More often than not, the team will gain valuable insight into how the system works, driving improvements in both existing and future detection controls.

Conclusion

As organizations move to cloud-native systems, security must evolve, gaining higher degrees of alignment in terms of the technology stack and DevOps mindset. That means creating continuous security pipelines that accompany the continuous development pipelines inherent to the cloud-native ecosystem. Detection engineering is one way organizations can accomplish that goal.

This post is an excerpt from a paper originally published at raincapital.vc