Building a Security Data Lake for the Future: Sygnia’s Migration from Elasticsearch to Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

14 min readMay 14, 2024

Since transitioning from Elasticsearch to Snowflake, Sygnia has seen a dramatic transformation in operations, eliminated storage constraints, lowered data platform costs by more than 35%, significantly reduced the number of support tickets, and empowered our analysts to run more complex queries with ease. These improvements have revolutionized how we operate, amplifying our capacity to address a broader client base with unparalleled efficiency.

Authors

Yuval Yogev, Chief Architect at Sygnia
Shira Maximov, R&D Team Lead at Sygnia

What We Do

Sygnia is a top-tier cyber technology and services company. We work with leading organizations worldwide, including Fortune 100 companies, to proactively build their cyber resilience and to respond to and defeat attacks within their networks.

Sygnia offers clients a full range of cyber readiness and response services to support every phase of an organization’s cyber resilience journey.
In this blog post, we will focus on two of these services that are most connected to the migration:

Managed Extended Detection and Response (MXDR) — A fully-managed service, operating 24/7, that provides threat monitoring, detection, investigation, and response, to bolster clients’ internal proactive security capabilities.
Incident Response (IR) — A service that provides clients with rapid breach containment, eradication, and digital forensics, to minimize the business disruption and damage resulting from a cyber attack.

Both our IR and MXDR services are powered by Velocity, Sygnia’s state-of-the-art, unified cyber platform for detection and response (XDR).

How We Do It

Comprehensive data collection: the foundation of cyber security intelligence

At Sygnia, every cyber investigation begins with extensive data collection — which is essential for creating a comprehensive view of the organization’s cyber security environment.

Using Velocity, we collect vast amounts of information from a variety of sources, including network traffic, logs from security devices, servers, endpoints, applications, and cloud-based resources — as well as forensic artifacts from almost every operating system.

Velocity ingests more than 50 TB per day, from hundreds of different data sources. It parses and enriches the data using various tools and techniques to form a unified, easily searchable security data lake, empowering our analysts to gain meaningful insights into the complete scope of any attack.

Data-Driven Analyst Empowerment: Accelerating Detection

Velocity collects, parses, and enriches data from a multitude of sources. It also uses detection scenarios to monitor security logs emanating from the network, applications, and endpoints.

Velocity provides organizations’ security teams with visibility, clarity, and a deeper understanding of their assets, in addition to details about threats across the entire organization, thus enhancing detection, monitoring, triage, investigation, and threat hunting.

Following is a high-level overview of our architecture:

Background

Existing Data Infrastructure with Elasticsearch

Prior to transitioning it to Snowflake, our data infrastructure was built around Elasticsearch, a powerful, open-source search and analytics engine designed for horizontal scalability and reliability.

Data format: Elasticsearch stores each document in a JSON format. This flexibility allowed us to handle a wide variety of data types coming from multiple sources, which was ideal for our diverse data needs.
Search capabilities: Elasticsearch’s reputation as a very fast search engine was a significant advantage for us. Its powerful, full-text search capabilities, combined with the ability to execute complex queries and aggregations in real time, supported our analysts in identifying and investigating potential threats.

Challenges and Limitations with Elasticsearch

Cost: Finding the optimal balance between storing large volumes of data and ensuring cost-effectiveness in Elasticsearch proved to be quite challenging. During cyber investigations, our aim is to collect and store all relevant data for the investigation. To support that aim, we often had to create additional data nodes, or add more resources to existing nodes. However, we often reverted to smaller data nodes, when feasible, to maintain cost-effectiveness. This iterative process was time-consuming and was likely to lead to mistakes.
Despite numerous optimization efforts, the expenses associated with storing petabytes of data in Elasticsearch clusters remained substantial — primarily due to the need for increased computing resources.
Retention policy: To keep costs and performance under control, we had to apply retention strategies, which meant moving old data to cold storage, and only keeping data for a limited number of days in Elasticsearch. This was a serious disadvantage when analyzing larger amounts of data, or looking up historical events.
Indexing vs. querying: In Elasticsearch, both storing new data (indexing) and searching through existing data (querying) rely on the same compute resources. This means that the execution of large queries slows down the indexing process, and vice versa, so during periods of high incoming data volumes, there is an adverse effect on the runtime of running queries.
Learning curve for new analysts:
We created a Python DSL (Domain Specific Language) in Velocity to wrap Elasticsearch query language, to overcome the difficulties that new analysts had when looking through data. This saved time and effort for new analysts, especially during their onboarding period.

Finding ways to solve these problems on a scale was essential as our client base continued to grow. After extensive research, we chose Snowflake to replace Elasticsearch as our new data platform.

Why Snowflake?

Separation of compute and storage: Snowflake’s architecture allows for independent scaling of compute and storage. This allows us to store huge amounts of data, with no retention policy, and separate ingestion and compute strategies.
Cost efficiency: By enabling precise control over computing resources and offering less expensive storage options, Snowflake reduces our overall costs — which is especially important when managing large volumes of data.
Scalability: Snowflake provides the flexibility to scale up or down easily, accommodating varying demands without requiring extensive infrastructure adjustments. This enables us to actively decide to pay more when we need results quickly, and pay less when we don’t.
The power of SQL: Our analysts can use a popular, standard query language, build and debug queries faster, and use complex operations like SQL-joins that were not supported in Elasticsearch.

First Steps in the Migration Process

Schema Strategy

The first challenge we had to tackle was our data model.

In Elasticsearch, we stored data directly in JSON format, without setting up a specific structure first. This method was useful, because we dealt with many different data sources, and often, we didn’t know what kind of fields we would get. JSON is flexible enough to handle this because it allows us to save any data — even if we’re not sure which fields are in the source.

For example:

{
  "timestamp": "2024–03–10T14:48:00",
  "sourceIP": "192.168.1.1",
  "destinationIP": "10.20.30.40",
  "action": {"status": "block", "reason": "SQL injection attempt"}
}

When we moved to Snowflake, we continued to manage data that didn’t fit into a strict format, by using a semi-structured data model. This approach allows us to keep our data organized, without needing a detailed blueprint for every piece of information we collect. In practice, we are using different Snowflake-supported data types, and specifically the variant type, which is very useful for storing these flexible structures while maintaining high query performance.

timestamp            | source_ip         | _data (Variant) 
---------------------|-------------------|-----------------------------------
                     |                   |{
2024-03-10T14:48:00  | 192.168.1.1       | "sourceIP": "192.168.1.1", 
                     |                   | "destinationIP": "10.20.30.40",
                     |                   | "action": {
                     |                   |   "status": "block",
                     |                   |   "reason": "SQL injection attempt"
                     |                   |   }
                     |                   | }

Ingesting the Data into Snowflake

The process of ingesting data differs significantly between Snowflake and Elasticsearch — particularly in the following two specific use cases that we had to consider:

Batch ingestion from multiple sources: To index to Elasticsearch, we used an elastic REST API request, which allows ingesting a batch of documents to multiple indices at once. This is especially useful in cyber security use cases, because the number of sources tends to be large.
Document overriding: When ingesting a document with the same ID, Elasticsearch overrides the document.

To accommodate these unique features in Snowflake, we considered various ingestion techniques, including Snowpipe, tasks and streams, native SQL, and more.
After researching, we decided that Snowpipe — a Snowflake service that allows loading data from a batch — would provide the highest performance, with almost no maintenance.

However, given that one Snowpipe can only write to a single destination table, we had to maintain a Snowpipe instance for each of our data sources. This means that for all our customers, we will need to maintain hundreds or thousands of Snowpipes.
Writing automation for Snowpipe configurations turned out to be key to allowing for smooth changes in the future. In addition, it meant that when uploading S3 files for Snowpipe, our pipeline had to intelligently split the data according to the source type — but also perform enough batching so that we could write large enough files. Snowflake’s best practice is to write files that are between 100–250MB.

We tried a few configurations and found that the file size has a significant effect on Snowpipe credits consumption; after a few configuration tweaks, we found a sweet spot of latency and cost.

This architecture allowed us to handle incoming data on a large scale, with little maintenance resources, but it was an append-only architecture — which is different from the ‘upsert’ mechanism provided by Elasticsearch, which allowed us to override a document with the same ID.

To support upserts, we added a dedicated pipeline, with its own Snowpipe; this pipeline in turn writes to a table that has a stream, and a task configured on it. The task periodically checks for updates waiting in that table and updates the relevant records in the destination tables. Updates in Snowflake are heavy because they rewrite the entire micro-partition behind the scenes.

Given this fact, it was important to both reduce the number of updates to a minimum and also optimize the query that the MERGE command is running, by applying different filters to find the affected rows quickly.

Here is a detailed diagram of our system architecture, showing how data is ingested using Snowflake pipelines:

Velocity Ingestion Architecture Using Snowpipes

Querying the Data: from Python to SQL

The query language that we used previously was a Python DSL, which allowed analysts to build queries using Python. Together with a hosted notebook editor, this created a very powerful querying environment. We wanted to keep the flexibility of Python, but use a more native SQL approach, now that our data is in Snowflake.

The result is a Python library that manages to balance both needs — Analysts can write queries in pure SQL, and also use Python abstractions to automate and extend the query functionality as required.

To help the analysts write complex queries, while keeping them as short and concise as possible, we used two strategies:

User-defined functions (UDFs): UDFs are Snowflake’s custom functions created by users to extend SQL capabilities in ways that are not available through standard SQL functions.
For example, our v_is_private_ip UDF checks whether an IP address within a dataset is a private IP. Here’s how it might be used in a query:

SELECT _id, source_ip
FROM firewall_source
WHERE v_is_private_ip(source_ip)

2. Transformations: We added in Velocity a translation layer to help our analysts write complex queries, thus we transform the query before it is sent to Snowflake.

A key example of this is our v_sources() transformation. In Elasticsearch, querying multiple indices is straightforward, but in Snowflake, querying multiple sources typically requires performing UNIONs across various tables.
The v_sources() transformation simplifies this process by allowing analysts to query multiple tables as if they were a single source.

For instance, the query:

SELECT _data 
FROM v_sources('source_1', 'source_2')

will be transformed into this:

SELECT "_data", "_data_source" 
FROM 
  (
    SELECT 
      _data[ '_data_source' ] AS "_data_source", 
      _data AS "_data", 
    FROM 
      source1 
    UNION ALL 
    SELECT 
      _data[ '_data_source' ] AS "_data_source", 
      _data AS "_data", 
    FROM 
      source2
  )

The python library allows us to inject defaults, like time filtering, sorting, and limits and also add logs, metrics, and guardrails to large queries.
As a final touch, we even added a native query highlighting + autocomplete on the SQL queries inside our Python library, which makes the overall experience very smooth.

This approach turned out to be highly successful, allowing for powerful automations and knowledge management — all while using the wildly popular SQL language.

Optimizations

After implementing the ingestion, and a Python package for querying the data, we started running our security detection scenarios — which are SQL queries.
The performance was good, but not great; we had to apply some optimizations to make sure all of the different use cases had the right SLA.

Snowflake’s team played a crucial role in this process, working closely with us to find the best solutions for our needs, and developing specific enhancements to meet our requirements.

Auto-Clustering

When querying cyber security logs, analysts often query on the last ‘X’ days — meaning that
the timestamp field plays an important role when querying the data. As such, we set our tables to automatically cluster by timestamp.
This setup, known as auto-clustering, helps organize the data efficiently — with the trade-off that it uses more compute.

Auto-clustering allows for every query that uses a timestamp filter to prune partitions very effectively and skip expensive data scans. Since our queries make extensive use of timestamp filters, the trade-off was worth it, both to allow fast querying, and simultaneously to reduce costs on the query side.
Auto-clustering also enhances top-k pruning, which we found to be one of the most critical optimizations for cyber security use cases.

For example, if a query filters on a timeframe between 14:00 and 15:00, it will skip scan partition 2 and will scan only partition 1:

Top-k Queries

Analysts often need quick access to recent events, such as viewing the last 100 connections to a domain controller, or the last 10 executions of binaries matching specific Indicators of Compromise (IOCs).

Here’s a typical query that analysts might use:

SELECT timestamp, file_path 
FROM process_data 
WHERE status='EXECUTED' and sha1= '2aae6c35c94fcfb415dbe95f408b9ce91ee846ed'
ORDER BY timestamp DESC
LIMIT 10;

Queries with only a LIMIT clause can be made faster by simply stopping execution as soon as enough rows have been seen, but when including the ORDER BY clause, Snowflake’s engine might have to scan the entire table in order find those that qualify for the result.
For example, in the query above, the engine might have to scan the entire table in order to retrieve only the 10 entries with the largest timestamp.

Since we use auto-clustering by timestamp, we can benefit from the top-k feature that allows us to skip fetching and scanning entire sections of tables that do not qualify for the result.

As Sygnia analysts’ primary requirement is generally to see the last ‘X’ records, we used our translation layer to add ORDER BY and LIMIT to every query, ensuring that every query will benefit from the top-k feature.

While initial query runtimes were impressive, Snowflake team released a second version of the top-k optimization which dramatically increased efficiency, cutting down query times from minutes to milliseconds.

For example, we transformed the query and added ORDER BY and LIMIT by default to each query:

SELECT *
FROM source1;

will result in :

SELECT * 
FROM source1
WHERE timestamp BETWEEN < now -1 day > AND now
ORDER BY timestamp DESC
LIMIT 10;

As shown above, we also added by default, filter on the last day, to optimize the default queries analysts run, while also providing the freedom to change those defaults.

Search Optimization

When filtering by other fields that are not clustered, Snowflake sometimes uses column metadata to filter out data, or additional indexing to speed up querying on specific fields. For example, if users tend to filter out Windows Event Logs by the ‘event_id’ field, it can be a good idea to enable search optimization on that column.

The search optimization service can significantly reduce the number of micro-partitions scanned for some queries, leading to remarkable improvements in performance.

As search optimization can have high credits consumption, it should be used when the cost and performance trade-off is worth the benefit to operational needs.

Full Text Search

One specific search pattern that our analysts used in Elasticsearch was to perform full text searches. This is a very powerful search, because although analysts often search for specific IOCs such as IPs, hashes, and URLs, they would actually prefer to search all the fields in a document.

The Snowflake team is working hard on supplying this functionality, using a new Search Optimization type that will speed up full text search queries and will unlock a significant use case for analysts, allowing for dynamic text searches across our semi-structured data, without costly full table scans.

The Migration Process

After the development phase, the process of performing the migration from Elasticsearch to Snowflake was done with great care, as our goal was to avoid downtime in critical Incident Response and MXDR services.

Here’s how we did it:

Starting in the development environment:

We started the transition in our development environment, where we could try things out without affecting our primary operations.
Running Elasticsearch and Snowflake side-by-side in this test environment allowed for direct comparison, allowing us to catch any errors at an early stage of migration.

2. Getting analysts involved early:

We made sure our data analysts were part of the process from the start. They know our data well, so they could help spot anything that didn’t look right.
This also helped the analysts get used to Snowflake gradually, and provide feedback on how to make it easier for them to use.

3. Monitoring:

We kept a close eye on how things were going in the dev environment, fixing any problems we found, and gathered feedback from our users.
We also built dedicated automation to help us make sure every event is accounted for, and that detections are running as expected.

4. Scaling up:

After everything worked well in the dev environment, we tried it in a larger setting, to make sure our plan would work even with a lot more data, still by indexing to both the Snowflake and Elasticsearch systems in parallel.
We conducted the same steps on this larger stage: running both systems, listening to feedback, and making adjustments.

5. Migrating production environments:

Migrating the tenant was done in chunks of six to seven tenants in a week, to ensure that the migration would be seamless:
We enabled Snowflake in the tenants, and validated that it was running as expected while Elasticsearch was still running.
We replaced all the security alerts running on Elasticsearch with new alerts that we re-wrote to run in Snowflake.
After receiving approval from the security teams, we shut down Elasticsearch in the environment.

The Impact

Cost reduction: More than a 35% reduction in costs — with further decreases expected, as we plan to work on more optimizations in the future, such as stream pipeline.
Limitless data: Velocity is now able to save all the data, from all sources, across all time periods.
Enhanced scalability: Auto-scaling of compute and storage, without manual intervention.
Operational efficiency: Fewer support tickets opened to our R&D team and less downtime.
Better user experience: Heavy queries no longer affect indexing performance, and vice versa, Moreover, the ability to search all the data, with no storage limitations, further enriches the user experience.

Lessons Learned

We invested a lot of effort in making sure that query and ingestion operations will run quickly, while keeping costs and maintenance low.
This was a challenging task, and in the course of this migration we learned that:

Ingestion with Snowpipes is the fastest, cheapest option available that accommodates our needs.
Good auto-clustering can make a major difference to the query runtimes, so it’s important to ensure that micro-partitions are optimally sized.
Top-k can be activated by default in each query, to minimize query runtimes.
Enabling search optimization can improve performance significantly, but could be cost-ineffective, depending on the type of data being searched.
Although UDFs are very effective in multiple use cases, they sometimes can be slower compared to raw SQL; this means that applying transformation on the query instead of the UDF may result in faster execution.

Final Thoughts

Migrating our core system from Elasticsearch to Snowflake was not an easy undertaking. It demanded changes across our product, including changes to our data model — all of which had to be achieved without customer downtime. Unlike a simple lift and shift migration, we had to get all the details right. This involved implementing top-k queries, search optimizations, assessing UDFs versus transformations, creating an efficient warehouse management strategy, and designing a scalable ingestion architecture.

The effort invested was substantial, and absolutely worth it. Now, we can store even larger volumes of data, scale efficiently, reduce infrastructure costs, and provide our analysts with robust detection and analysis tools. Building this data lake allows us to continue to protect our customers, with a solid foundation that is ready to scale for the future.