Shifting from Detection 1.0 to 2.0

Photo by Maximalfocus on Unsplash

As the threat landscape continues to evolve, Threat Detection teams are doing their very best to ensure their organization has enough detection coverage and depth while balancing the noise and overall alert volume generated from these detections. The backlog of detections that need to be built continues to grow and alert fatigue continues to be a problem. Reviewing our internal situation, we found that one of the primary reasons for alert fatigue was that all our detections were atomic: they were surfacing an alert every time they triggered, and they lacked context on the asset and identity tied to the alert. And while certain detections will have to always remain atomic, that should not be the default. To add, many organizations are still operating in the traditional Security Information and Event Management (SIEM) model, where security data is segregated from business data, and lack a detection development process. How can teams start to break away from this and overcome these challenges? At Snowflake, the Threat Detection Team has an evergreen goal of reducing alert noise and improving detection fidelity. As part of our internal transformation, we have grouped these challenges into what we call Detection 1.0. Our shift to Detection 2.0 is highlighted by the following initiatives:

  1. Adopting a security data lake architecture: all activity, asset and identity data in a single analytics platform with the ability to join individual records into context together
  2. Developing a detection development lifecycle: a well defined process to build and maintain detections
  3. Implementing Asset and Identity Prioritization + Risk Based Alerting: Applying the context of the affected assets and identities to dynamically adjust detection severity and grouping related detections. Alerts will only surface when a said user or asset has breached the defined risk score.

With Detection 2.0, we’re reducing the overall alert volume, avoiding having to triage false positives, and responding to threats more efficiently based on the risk they pose to our company.

Traditional SIEM Model → Security Data Lake

The traditional model, where the SIEM is a separate data store from the enterprise data, has failed security teams (Figure 1). Most organizations that are operating in a traditional SIEM are not ingesting business data such as Workday, Salesforce, ServiceNow, etc. This data siloing can make it difficult for security practitioners when conducting investigations. I have worked on numerous incidents where we could not begin our investigation because the logs we needed were not in the SIEM and were stored in another logging platform not accessible by the security team. Furthermore, the volume of log data that security teams need to analyze has increased exponentially, and some teams are faced with the challenge of selecting a subset critical log sources that they need ingested into their SIEM or are having to reduce retention as the sheer volume of data from flow logs or Endpoint Detection and Response (EDR) solutions alone can destroy your SIEM license.

Figure 1: Traditional SIEM Model

At Snowflake, with our security data lake architecture, these challenges and gaps have been non-existent as all data is ingested into a single platform with finely controlled access (Figure 2). I cannot stress how invaluable this has been for our security teams. The Threat Detection team is able to build complex correlations which lead to higher fidelity alerts by joining employee data from Workday, application data from Salesforce, or ticket data from ServiceNow with traditional security log sources. Oftentimes the first step of an investigation for some alerts would be to pivot to one of these platforms and pull the data needed, which can delay the speed analysis and case closure. Our security teams have all of the data needed to build complex detections, hunt, conduct analysis and response, and build analytics/metrics all in a unified, cloud data platform. I would like to further highlight that our security data lake architecture enabled us to seamlessly build our Asset and Identity Prioritization framework, which I will discuss later in this post.

Figure 2: Snowflake Security Data Lake

Ad Hoc Detection Development → Detection Development Lifecycle

When detections are built with no formal development process, it leads to a lack of alert quality and detection tech debt accumulates. Establishing a well-defined Detection Development Lifecycle enhances the quality of the detections built, provides robust documentation, helps with scaling the team, and serves as the foundation for program metrics. Developing this lifecycle was a critical first step in our internal transformation from Detection 1.0 to 2.0 as it enabled us to better understand the challenges and pain points of our detections with metrics and insights from the Monitoring Phase via Detection Improvement Requests (DIRs), Detection Decommission Requests, and Detection Reviews. This enabled us to understand and track the quality and performance of our detections, making a strong case for the need to implement Asset and Identity Prioritization and Risk Based Alerting.

Lack of Alert Prioritization → Asset and Identity Prioritization

For anyone that’s ever worked in a Security Operation Center (SOC) with an alert queue, I’m sure you’ve dealt with alert fatigue and the dilemma of which alert you should work first. The obvious answer is to work the highest severity alert such as the “criticals” or “highs,” but what do you do when there are multiple alerts with the highest severity in the queue? The solution here is that Threat Detection teams need to dynamically adjust alert severity based on the criticality of the user or asset tied to the alert: Asset and Identity Prioritization. For example, if you have two medium alerts that have triggered, but one affects a user with admin privileges and the other does not, then it should be clear to the analyst which alert is truly the most critical.

At Snowflake, we started off by building an Asset and Identity Prioritization framework that comprised all the access that users have in various environments such as Okta, Cloud Service Providers (CSPs), Workday, Github, etc. We also added entity information from our EDR solution and CSPs to include general host information, environment, data sensitivity levels, etc. It is important to note that some of the data used is not traditional security data, but rather the role and permissions the user has for a system. By combining this business and security data, we were able to build a comprehensive framework where we can calculate the true risk score for an asset or identity and dynamically adjust the detection severity based on this information.

Alert Fatigue + Lack of Correlation → Risk Based Alerting

To further build on this, the next question that Threat Detection teams need to start asking is if every detection needs to be atomic? The answer to that question is no. When all detections are atomic, it almost always leads to alert fatigue for the SOC and IR teams especially as the number of detections built increases. It also puts added pressure on Threat Detection teams building alert content to produce high fidelity detections, and because of this they will oftentimes forgo building a detection because they can’t get the volume down to an acceptable level. This has been a never ending cycle of pain for many security operations functions. Also, in my previous experiences working as an analyst, I would find myself manually correlating cases tied to the user or asset of the alert I was currently working. Alert correlation is critical, and if it is not being done, especially if there are multiple analysts working across multiple shifts, then it can present a serious gap and malicious activity will potentially be missed.

At Snowflake, we have been working towards a model where we are amalgamating risk scores across entities and are building risk breached detections. All existing detections will be building risk based on their severity (low, medium, high, critical) and an alert will be triggered only when the risk score has been breached within a certain window of time for a given entity (Figure 3). I have been a part of security teams where we collectively decided to disable all lows and informational severity alerts because the volume was too much to handle. This model has allowed us to keep all of our informational alerts active, but not firing atomically into our alert queue overwhelming our response team. Informational alerts are given a score of zero and so they will never help to breach risk, but will be there as contextual information if triggered and tied to an entity that has breached risk. Furthermore, critical severity alerts will always ensure that risk is breached for a given entity, keeping the atomic nature of our highest severity. As an example, if a user triggers three informational alerts and then later in the day triggers a critical alert, those four alerts will surface as one risk breached alert for the given user when the critical alert triggers.

Figure 3: Risk Based Alert in Panther with two detections for the same system surfaced as one alert

When building this framework, we found that we needed a normalized way to group assets and identities tied to the detections. As an example, some of our detections reference a username, while others reference a user’s email address. In order to be able to group these detections, we dedicated a significant amount of time towards adding an entity array to all of our detections. If there are gaps in the entity array, there is a risk that that alerts will not get grouped leading to misses. Adopting this approach to alerting is not without challenges. It is a complex engine and it will be difficult to predict what will surface as an alert as engineers lose some degree of control over what gets surfaced. There will have to be a significant amount of trust in the detections built and the scoring model implemented. By adopting a Risk Based Alerting framework, Threat Detection teams can be flexible in the detections they build, helping to reduce the overall volume of atomic alerts worked without decreasing the amount of detections, and help provide correlation of related alerts.

Asset and Identity Prioritization + Risk Based Alerting

As previously mentioned, we are still working towards building some of these solutions and I am sure there are questions like: Which one should I go about implementing first? Can they be built together? I believe that you can choose to combine these efforts or go about implementing them separately. Efforts like normalizing entities will be required for both Asset and Identity Prioritization and Risk Based Alerting. Given we had an atomic queue at Snowflake, it made more sense to focus on Asset and Identity Prioritization to help better prioritize the alerts in the queue. Also, the framework built for Asset and Identity Prioritization naturally feeds into Risk Based Alerting. When Risk Based Alerting is adopted without Asset and Identity Prioritization, both a standard user and an admin will be building risk at the same pace. As an example, let’s assume that both of these users have a risk bucket that is made up of the alerts that they’ve triggered. If both users trigger the same alert, the risk added to their respective bucket will be the same because the context of their identity is missing (Figure 4).

Figure 4: Risk Building without Asset and Identity Prioritization

When Risk Based Alerting is implemented with Asset and Identity Prioritization, the admin will fill their risk bucket at a faster pace than the standard user since there is context around their identity (Figure 5).

Figure 5: Risk Building with Asset and Identity Prioritization

Conclusion

This post highlighted our internal transformation from Detection 1.0 to 2.0, which has been and continues to be a key focus for the Threat Detection team at Snowflake. The solutions presented are part of our strategy towards aggressively tackling our goal of reducing alert noise and improving detection fidelity. One of the key focus areas that I did not talk about in our transformation was around automation and SOAR solutions. The reason for this is that this tooling is handled by our Incident Response Team at Snowflake. Please let me know your thoughts on what detection efforts your team is working towards or if we are headed in the right direction at Snowflake. As always, feel free to connect with me on LinkedIn, and I’d be happy to chat.

--

--