Detection Engineering for Cloud-Native Security

Rain Capital
Mar 19, 2020 · 9 min read

by Jamie Lewis, Venture Partner

The rapid move to cloud-native architectures is having a profound impact on enterprise security posture and operations. In the world of containers, microservices, and orchestration frameworks, the notions of an “application” running on a “machine” in a persistent “state” are obsolete. The application, or service, is now a distributed system, consisting of multiple components running on a highly variable number of nodes, in a nearly constant state of change. Traditional security controls that rely on machine isolation and a predictable system state are ineffective. Security policies that are blind to service-to-service communications and controls that lack horizontal scalability simply cannot keep pace with today’s microservice applications.

To protect business assets in cloud-native environments, organizations must bring security practices and technologies into architectural alignment with the systems they are meant to protect. Just as DevOps enables continuous development and deployment pipelines, “DevSecOps” must enable continuous security pipelines. That means establishing new methods, capabilities, and instrumentation, ensuring that systems designed to protect cloud-native systems embody these basic characteristics:

  • Extensive, real-time visibility: Partial or after-the-fact visibility will not suffice. Both the infrastructure layer and applications, wherever they are, must be visible.
  • Rapid, iterative feedback loops: Feedback loops allow security measures to adapt continually to rapidly changing environments.
  • An engineering approach to solving security problems: Automation, continuous measurement, and controlled experimentation will be the predominant method of solving security problems across the enterprise, replacing manual analysis and control updates.

Detection engineering is a pivotal aspect of this alignment. Detection engineering uses automation and leverages the cloud-native stack to discover threats and orchestrate responses before they can do significant damage. As part of a move to DevSecOps, detection engineering can improve security posture. Consequently, organizations making the transition to cloud-native architectures should consider how and when to incorporate detection into their security programs.

Detection Engineering

Detection: The debate over preventative versus detective controls isn’t new. Conventional wisdom holds that the key is finding the right balance between the two given an organization’s risk profile. But most enterprises have invested significantly more in prevention than they have in detection. In fact, much of the cybersecurity industry has focused on the creation, marketing, and sale of prevention technologies and products. But those products are failing. Insider threats, social engineering, zero-day attacks, determined (often state-sponsored) attackers, and many other factors have made an over-reliance on prevention a losing bet. It’s simply smarter, and more effective, for security managers to focus on detection rather than attempting to build impenetrable systems.

Engineering: The goal of any successful alerting system is to separate signal from noise, distilling meaningful and actionable alerts from the collection of event information, moving them up the chain for remediation and response. In a typical security operations center (SOC), analysts process those alerts, determining their severity and whether to escalate them to a higher-level analyst or incident response team. Processing alerts involves compiling contextual data (who, what system, how it’s used, what roles, and so on), filtering according to some or all of that contextual data, comparing events with threat feeds, and assembling a coherent picture of what happened — all before deciding what to do about the alert.

Given the sheer volume of alerts and event logs complex systems can create, this is a staggering task at best. Overwhelmed by mountains of busywork, security programs suffer from analyst burnout and alert fatigue. As analysts become desensitized to the staggering data load, real problems slip through their fingers. As Ryan McGeehan said in a recent post, “when a human being is needed to manually receive an alert, contextualize it, investigate it, and mitigate it . . . it is a declaration of failure.”

Consequently, organizations such as Netflix, Lyft, and Square have started treating threat detection as an engineering problem, using automation to avoid these pitfalls and make security teams more effective. They are also avoiding the silos that separate detection, response, and development teams, following the DevOps mindset when building detection mechanisms and integrating them with response orchestrations.

Detection Engineering Infrastructure

  • Data sources
  • Event pipelines
  • The correlation engine
  • Response orchestrations
  • Testing and feedback loops

Our definitions and examples of detection engineering infrastructure components rely heavily on the work of Alex Maestretti, formerly the engineering manager of Netflix’s detection and response team, and Ryan McGeehan. We thank them for sharing their insights with us and allowing us to reference them here.

Data Sources

  • Cloud-native visibility: The cloud-native architecture has introduced new layers and components such as containers, orchestrators, service mesh, and others. Collecting data from these systems will likely require new tools and instrumentation.
  • Align instrumentation and log consumption: Instead of creating a central team that deals with logs written by others, security engineers should work directly with DevOps to instrument data collection. The security organization should also hold those engineers accountable for the results. Netflix, for instance, holds the philosophy that the person who logs the data should be the person who consumes that data. This brings a new level of rigor and discipline to the process of creating useful logs.
  • Find the best sources: Different systems support different levels of instrumentation and use different data schemas and models. Security managers need to find and understand the different logs and formats and decide which are relevant to threat detection.
  • Decentralizing detection The traditional detection architecture focuses on gathering a massive amount of data into a centralized data lake and run detection algorithm on it. This means detection can hardly be in real time as the amount of analysis needed is enormous. Utilizing a distributed architecture that performs detection closer to major sources of data allows for much faster detection and higher quality of data to be gathered.

Event Pipelines

  • Stream vs. Batch: Today, many SaaS products and most legacy systems can’t push log and instrumentation data to the security system, making batch processes a necessity. But organizations should instrument services to stream data in real time to event pipelines when practical and possible.
  • Normalization vs. Workflows: Security teams should avoid the difficult work of normalizing event data by creating pipelines for each data source, basing its workflow on the data in that system. Templates and reusable modules to streamline work on common data types can streamline these efforts.
  • Alert Frameworks: A rigorous framework for creating events and alerts, and the rules that drive them, is essential. The process should follow the same engineering standards that govern software projects, making alerts subject to peer review and using a version-controlled repository. And the person who writes an alert should be accountable for its results. Palantir’s incident response team posted an excellent write up on such a framework.
  • Enrichment: The low quality of security alerts is a major contributor to alert fatigue. Automation should enrich event data, adding critical information, eliminating manual labor, and improving alert quality. In general, less-expensive enrichments — such as a health and schema check or lookups for specific values, such as the geo-location of an IP address — occur in the event pipeline.
  • Machine Learning: Proven machine learning techniques increase the ability of the system to detect active threats and conditions that warrant further investigation, reducing false positives. Algorithms for feature extraction run on the data, building models of actions, looking for anomalies and conditions, generating model-based events that drive event triggers.
  • Forensic Data Storage: Detection engineering infrastructure should store and archive event data, which can be crucial when it comes to forensic activity.

The Correlation Engine

  • Choice of Platform: The correlation engine is typically a data analytics platform. Commercial detection engineering products such as Capsule8 include their own data analytics engine. Some organizations rely on Splunk while others use Elasticsearch. (Disclosure: Rain Capital has an investment in Capsule8.)
  • Further Enrichment: The correlation engine uses rules and more expensive enrichment to determine whether the system triggers an alert sent to a human or invokes automated response mechanisms. Security teams will need additional tools and services to gather and include data such as the user accounts involved in an event, their security classification, who they work for, and their contact information. Context-specific information can include screenshots of what the user saw (in the case of a phishing attack, for example), how a given operation was launched (manually vs. automatically), what privilege levels were used, and any privilege escalation that occurred. Automated communication mechanisms, such as Slackbots, can reach out to get confirmations of activity or more information from users.

Response Orchestration

  • Automate common fixes: If a common temporary fix involves re-deploying a cluster in Kubernetes, for example, a workflow could re-deploy it automatically, triggered by a rule. Alternatively, an alert could include re-deployment as an option, allowing the human receiving the alert to do so with just the click of a mouse, saving a great deal of time and effort.
  • Build in communication: Effective response and escalation orchestration should include mechanisms that can quickly bring users, developers, and other relevant players into the communication loop. Integration with Slackbots is often a key component of detection engineering. Dropbox recently released Securitybot, a communication mechanism designed specifically for integration with detection engineering systems, under an open source license.

Testing and Feedback Loops

Conclusion

This post is an excerpt from a paper originally published at raincapital.vc

Originally published at https://www.raincapital.vc on March 17, 2020.

The Startup

Get smarter at building your thing. Join The Startup’s +724K followers.

Rain Capital

Written by

Rain Capital is a cybersecurity venture fund based in the San Francisco bay area. A women-led and -managed fund, Rain invests in disruptive security companies.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +724K followers.

Rain Capital

Written by

Rain Capital is a cybersecurity venture fund based in the San Francisco bay area. A women-led and -managed fund, Rain invests in disruptive security companies.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +724K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store