These common data silos are driving security data lake adoption

Omer Singer
Geek Culture
Published in
5 min readJun 28, 2021

Here’s a term that cybersecurity practitioners should adopt from the analytics industry: “data silo”. That’s anywhere that data is left behind and not collected to a central source of truth. Silos make it hard to mine data for insights and busting them is a big part of data engineering. Increasingly, security teams are facing data silos caused by cloud migration and work-from-home outpacing traditional Security Information Event Management (SIEM).

Why are SIEM solutions responsible for creating data silos? Many SIEM solutions charge by daily ingest volume, a model that doesn’t align with the massive amount of logs generated by public cloud infrastructure. While security budgets increase linearly at best, AWS and Azure record exponentially more telemetry than on-prem data centers. Other security data sources are exploding as well, forcing security teams to accept gaps in their source of truth. In some cases, valuable log sources are shut off entirely to avoid exceeding daily ingest limits imposed by the SIEM vendor.

Another way that many SIEM products cause siloing is through multi-tier architectures. As they scramble to keep up with data volumes on one hand and regulatory requirements on the other, legacy solutions copy data out to various archive tiers. Restoring or “rehydrating” cold data is a process that can take days or weeks.

Even SIEMs that don’t impose daily collection quotas or retention limits can create silos by keeping security data away from business data and the enterprise analytics team. Segregating security from the rest of the enterprise data stack makes it hard to apply helpful business context to security logic, and forces cyber to solve data ingestion and visualization challenges without support from the company’s ETL and BI tools, processes and experts.

The harmful impact of data silos should not be underestimated. When a security data source is not centralized, it cannot be used in correlation with other data for threat detection. Incident responders struggle to separately search through disparate data silos when investigating a possible breach. And when data silos are prevalent, it’s easy to miss visibility gaps until they’re involved in an incident.

Certain data sources are frequently cited as drivers for security data lake initiatives. The following are high volume security datasets that should be centralized for effective threat detection and response:

  • CrowdStrike raw event data: Unlike the antivirus agents of the past, Endpoint Detection and Response (EDR) agents track nearly every action that takes place on the endpoint. Solutions like CrowdStrike act as a “flight recorder” that creates a record of every file created, process started and connection established. These logs are immensely valuable in detecting and responding to threats- especially for remote workers that don’t operate behind a corporate firewall. CrowdStrike EDR is often the first line of defense but the sheer volume of telemetry prevents it from being collected to the SIEM. Instead, many organizations collect only the alert records (less than 0.1% of the total data) while leaving the forensic details siloed in CrowdStrike’s environment. CrowdStrike holds the enriched sensor data for only a short while (7–90 days depending on your plan) and in fact recommends creating an “offline replica of enriched sensor data for use
    in local data warehouse or data lake, and correlation
    against logs collected from other systems.” Not combining endpoint data with other security and business datasets significantly limits EDR’s contribution to the overall threat detection effort of the organization, while less than a year of retention jeopardizes breach investigations.
  • AWS S3 access logs: Many breaches in recent years involved sensitive files leaking from cloud storage buckets. To mitigate the impact of these incidents, it’s critical to know what files were downloaded from affected buckets and by which users. Many security organizations don’t take into account that this logging option is disabled by default. Storing S3 access logs outside of the SIEM, or not enabling this logging in the first place, is a common data silo situation that hampers cloud security investigations.
  • AWS VPC flow logs: Like S3 access logs, VPC flow logs are disabled by default in AWS. This isn’t because they’re not valuable. Spotting an attacker in cloud infrastructure where systems are constantly popping in and out of existence requires network visibility, in addition to threat intelligence and anomaly detection. Failing to enable or collect flow logs represents a serious gap that is often caused by the prohibitive cost of SIEM ingestion.
  • Windows Powershell activity: Ransomware and other attack tools often use Windows Powershell scripting to run on Windows servers and avoid detection while they encrypt their victim’s files. Unfortunately, most IT organizations don’t track Powershell activity due to the volume of legitimate Powershell events taking place. This leaves security teams unable to determine the extent or “blast radius” of a breach where Powershell-based malware was used.
  • Salesforce monitoring events: Salesforce provides a real-time event stream for security visibility. This data is helpful for detecting insider threats, especially when combined with HR termination records, but is often left siloed within the CRM. Out of sight, out of mind.
  • Snowflake access history: Access history is a new view in Snowflake that helps customers produce granular, column-level reports to satisfy compliance audits. It can be analyzed to identify insider threats or compromised user accounts. If a corporate laptop is compromised, investigators should confirm that the attackers did not abuse the owner’s Snowflake access to download sensitive information.
  • Sharepoint audit reports: Microsoft Sharepoint supports detailed audit reports on document access. Breach investigations should have easy access to user activity in Sharepoint and, in case a trojan file is being distributed, security should have visibility into who downloaded it.
  • Palo Alto GlobalProtect VPN activity: The infrastructure that enables remote logins to corporate networks is increasingly targeted by threat actors. This makes VPN activity an important data source to analyze, not just who is entering but what are they doing once they’re on the network. A single-dimensional approach where VPN data is stored in a silo makes it hard to spot compromised VPN users. Using multiple dimensions of analytics, by joining together VPN activity with endpoint telemetry and HR records, can highlight when a remote user is acting without intention or permission.
  • Jira ticket updates: While many engineering organizations have embraced DevOps to “shift left”, most security teams haven’t taken the steps necessary to keep up. This means that frequent deployments within cloud infrastructure translate to noisy alerts and burnout for security analysts. What if Jira ticket information could be factored into alert logic? A change to an Azure security group might not need to ring alarm bells if it was appropriately authorized in a corresponding Jira ticket. Conversely, a seemingly minor change in IAM policy should trigger a security investigation if it wasn’t first approved in a ticket.
  • Tenable vulnerability findings: Server patch levels are highly relevant to threat detection and response. Any detection on an unpatched system may need to be quickly escalated to contain the incident- even if the alert wouldn’t otherwise be considered urgent. Especially in fast-changing cloud environments, incident responders should know if a system was briefly exposed in an unpatched state. Siloing vulnerability data in the vulnerability management solution prevents these valuable insights.

This is far from being a complete list of commonly siloed security datasets. The important takeaway is that limitations in the security data stack, with the SIEM as the data platform, have caused dangerous silos that are becoming more severe and impactful. If you recognize these in your environment, it might be time to consider a security data lake architecture.

--

--

Omer Singer
Geek Culture

I believe that better data is the key to better security. These are personal posts that don’t represent Snowflake.