Subtlety in synonyms: not my strong suit.

The Expat Guide to Search Optimization Service

Search Optimization Service (or SOS) is a long-standing feature within the Snowflake platform that improves the response time of highly selective queries. This can both reduce cost and increase performance, particularly since new capabilities of SOS were made Generally Available in April 2023. However, many Snowflake practitioners — even those who have been using the platform for many years — don’t know what Search Optimization is, much less how it can help them speed up common query patterns, and enable new uses of their data on a daily basis.

This guide is designed to help you understand Search Optimization Service, and how you should use it to speed up your Snowflake queries. It’s broken down into 2 sections: In Section 1, I Want to Know What SOS Is, you will learn what SOS is, how it works, why you haven’t used it before and what the new features are that will make SOS more useful for you than ever before. In Section 2, I Want You To Show Me, you will learn the more concrete aspects of using SOS, including what to evaluate before rolling it out, how to implement it, how to take advantage of it in your queries, and finally, what you need to keep in mind as you roll out SOS in your enterprise.

In particular, you should read this article if you have queries that search for:

  1. A very small number of items that will occur infrequently in a much larger dataset that is clustered on a different field, like a unique object UUID in a fact table
  2. Substrings within larger (potentially unstructured) text based fields, like looking for keywords in a notes field
  3. Uncommon values within specific subfields in much larger variants, like a specific IP address in web logs
  4. An event (or a set of events) that occurred within a specific area

Effective application of Search Optimization Service can reduce query times by 95% and beyond — for very little cost. Panther Labs sped their point-lookups up by over 100x. Let’s get started on doing the same!

I Want To Know what SOS is

Let’s talk about Search

What is Search Optimization, and how does it work?

SOS in Snowflake is one of the 3 serverless storage optimizations available in Snowflake. The other two are Automatic Clustering and Materialized Views. The goal of SOS is to speed up ‘needle in a haystack’ style access to data. Unsurprisingly, this is the kind of query most often used in search use cases.

SOS works by building a secondary data structure called a ‘Search Access Path’. The Search Access Path keeps track of which values can possibly appear in a given micro-partition, and which ones definitely cannot. This gives Snowflake a tertiary method of excluding micro-partitions that cannot possibly respond to a query before they’re scanned. This is in addition to the way that Snowflake uses the minimum and maximum values of each micro-partition to exclude them on the basis of filter and join predicates. Just as without SOS, micro-partitions that remain after the additional pruning process are still scanned as normal — SOS only affects the process of selecting which ones should be scanned.

Only the last step is added by SOS, but it can make your queries much faster!

The build process that needs to occur before SOS can start to optimize your queries is managed by Snowflake behind the scenes. This is just like Automatic Clustering, Materialized Views or Dynamic Tables. Also similar to Automatic Clustering is the zero-touch ongoing maintenance of the search path structure, which is managed by Snowflake with no additional input. Both of these processes have a cost, however, which we will discuss in the I Want You to Show Me section below. There is also a cost to storing the search path structure, which is charged similarly to table storage¹.

Once the build process is complete, Snowflake uses a combination of cost-based and heuristic indicators at compile-time to determine whether or not using the search access path is likely to be beneficial. There are obvious outcomes from this process such as not using SOS where we have already pruned down to a very small number of micro-partitions. There are also non-obvious outcomes, such as disqualifying a query from SOS due to large lists of values in an IN predicate. Once the compiler determines that the access path is likely to reduce the query time, the execution layer (virtual warehouse) will add a step before the relevant scan(s). This step will first involve downloading the files containing the access path and then using the access path to evaluate which micro-partitions may or definitely do not contain the desired values. Once the potentially relevant micro-partitions have been identified, they are scanned, filtered, transformed and aggregated as for any other query.

SOS is most effective when it can reduce a very large number of potentially responding micro-partitions in a query to a very small number. This is what we mean by ‘needle in a haystack’ queries. Optimally, SOS works best for unique values in a large dataset. However, it is also particularly effective where clustering on the relevant field may not be feasible or useful. Examples of this include UUID based keys in a fact table and variable length substrings in a text field.

In my life, there’s been heartache and pain

If it’s so great, why doesn’t everyone use Search Optimization already?

In the past, Search Optimization was an ‘all or nothing’ feature when applied to a table. If SOS was enabled for a table, it would then create search paths for every column in the table. This could be quite expensive, and also does not match the pattern of use for most organizations. There’s generally very little requirement for SOS over metric columns, for example. In addition, previous editions of Search Optimization required full text matching — it was not compatible with substring matching, or with the variant type. This made it a relatively niche feature, suitable primarily for tables and values built specifically to make use of it. However, new developments on SOS have both reduced the cost of the tool, and increased the functionality. Today’s SOS makes it more viable than ever to bring large search use cases into Snowflake.

Through the clouds I see SOS shine

What is this fantastic new functionality?

There are two key ways that Search Optimization Service was improved in early 2023: Cost and Functionality. We’ll include a summary here, but for the full details, check out the announcement posts.

Cost

When it comes to cost, the primary improvement was the ability to select the columns and functions that are relevant for SOS on the table at hand. This has massively improved the cost profiles for organizations such as Deliveroo, who were previously using SOS but were able to reduce the cost of SOS maintenance by 90%. In addition, Snowflake has released the ability to estimate the costs to build, store, and maintain the search access paths using the SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS function. This ensures that costs are well understood even before beginning the process of building the search paths structure. Combined, these two additions make Search Optimization viable even in organizations where cost is a primary concern.

Functionality

When it comes to functionality, there have been several roll-outs as Snowflake continually adds value to the Search Optimization feature-set. In April 2023, Snowflake rolled out:

  1. The ability to leverage SOS for searches where the target string was only a substring of the contents of the target field
  2. The ability to leverage SOS to speed up searches over Geospatial data stored in the GEOGRAPHY column type
  3. The ability to leverage SOS to speed up searches for specific values in semi-structured data (VARIANT, ARRAY, and OBJECT data types)

Early in 2023, Snowflake also rolled out the ability to leverage SOS to speed up searches for substrings in semi-structured data.

These new features enable a suite of use cases directly on top of data that many organizations already have stored in Snowflake for other purposes, including:

  • Searching for customers, deliveries, or events in certain geographical areas
  • Finding key words in notes fields on customer records and other unstructured free-text fields
  • Finding specific values in variable length arrays from data integrations supplied by third-party applications, or instances where these values did not appear

SOS is also a key element in enabling organisations to expand the effectiveness of their Snowflake investment by bringing in data that previously required complicated and expensive third party tools. For example, many organizations that have outgrown their ELK (Elasticsearch, Logstash, Kibana) toolset have been successful in migrating much of that data and its associated operational analytics to Snowflake. Through this, they have achieved both cost savings and additional value from their data investments. Other organisations have been successful in the cybersecurity sphere, ingesting significant security focused logs and leveraging SOS for point-queries — such as IP Address and File Hashes.

I Want You to Show Me

Now that we know a little more about what the Search Optimization Service is, and how it works, it’s time to look at how you can implement it. We’ll discuss how to evaluate whether a table is suitable for Search Optimization, estimate the cost of the build and maintenance process for SOS, and start the build process. Then we’ll demonstrate how to write queries that leverage the search path for greater speed and efficiency. Finally, we’ll discuss some of the benefits and caveats of SOS that you need to consider before rolling it out.

A little time to think things over

Planning and implementing SOS

There are two things to consider when looking at whether to apply Search Optimization to a table or not — workload and cost.

Workload

Does the table have a high number of queries that should benefit from Search Optimization? If it doesn’t today, are there use cases for that in the future?

First, look for tables where the primary key, or a commonly looked up secondary key, is commonly used to find values, but is not automatically well clustered by virtue of the clustering key or ingestion order on the table. This will often give rise to very large numbers of partitions scanned for very low numbers of rows returned. For example, consider a UUID key field in a table that’s generally clustered by creation date. The ordering of the UUIDs will have very little to do with date order, but at the same time each UUID will only appear in a single micro-partition.

Second, consider non-key columns that appear infrequently throughout the set, but be aware that the more distributed the search values are, the less effective Search Optimization can be. For example, consider two software companies with equally sized data sets. One is a B2C organization with millions of customers, while the other is a B2B organization with 10 customers. When searching for all transactions by a single customer, Search Optimization is highly likely to be very useful for the first company, but not very useful for the second one, because of the distribution of each customer throughout the dataset. These columns can be harder to find than queries that return a single record, as the number of rows returned might still be reasonably high. It is still possible to find queries with high numbers of partitions scanned and analyze them from there.

Third, consider tables which make heavy use of VARIANT or other semi-structured columns. Snowflake does collect some metadata for these, but it is somewhat heuristic in nature, and frequently doesn’t align with the way the tables are clustered. As a result, pruning without Search Optimization can be limited. Subject to the effects of data distribution discussed above, applying Search Optimization to the fields within the semi-structured data that you frequently query will help Snowflake minimize the scan-set for those queries.

Finally, consider opportunities to leverage geospatial and substring support in Search Optimization. Sometimes these queries won’t appear very frequently in your historic query patterns, because users may have tried them once or twice and given up if they’re very slow. This can change with Search Optimization! You may need to engage with the power users and data owners in your organization to discuss whether they would benefit from faster substring searches and geospatial searches. If so, you can now roll those use cases out very quickly!

Data distribution is key to predicting whether Search Optimization Service will work well for a given predicate.

For more detail on identifying queries that could benefit from Search Optimization, check out the documentation.

Cost

What will the cost be to implement Search Optimization on this table?

It’s important to understand the cost of Search Optimization before you enable it. Once enabled, it is important to continually monitor the ongoing maintenance and storage costs and balance them against the benefits. Snowflake offers several methods for this purpose. To estimate the cost of Search Optimization before starting the build process, you can use the SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS function. Be sure to provide the columns or semi-structured paths, and the specific Search Optimization type (equality, substring or geo). If the table has been around for longer than a week, the function will return three cost positions: BuildCost, MaintenanceCost and StorageCost. The BuildCost is the one-time cost to build the search path, specified in credits. The MaintenanceCost is the ongoing monthly maintenance charge to keep the search path up to date, also specified in credits. Finally, the StorageCost is the amount of space the search path will take up, and is specified in TB (charged each month). It’s important to note that the function is a best effort estimation, but in general it’s a very good guide as to what the cost will end up being.

After building SOS out for some tables, you should track the ongoing cost of the function to ensure that it matches what you estimated during the planning process. For that, use the view in the ACCOUNT_USAGE schema called SEARCH_OPTIMIZATION_HISTORY. This will track both the build costs and ongoing maintenance costs of Search Optimization. Remember that the build costs are usually higher than maintenance costs. It’s important to consider the cost over time rather than only the up-front spike of the initial build.

Build Process

Once you’ve selected the tables for Search Optimization and estimated the cost, it’s time to start the build process! To control costs, ensure SOS is only built for certain columns on the table using the ON clause in the ALTER TABLE statement. This will also set the specific type of Search Optimization. For example, if you want to search for exact matches against a key column as discussed above, you would use this statement:

ALTER TABLE fact_transactions ADD SEARCH OPTIMIZATION ON equality(uuid_key_column);.

Meanwhile, if you want to search for exact IP address matches or substring matches for user submitted content in semi-structured weblogs, you would use²:

ALTER TABLE weblogs ADD SEARCH OPTIMIZATION ON equality(log:network.ip.addr), substring(log:user.submitted);

Note that we can apply SOS to two columns at once!

Finally, leveraging SOS for geospatial searches like finding all the events in a certain area is as simple as:

ALTER TABLE events_geo ADD SEARCH OPTIMIZATION ON geo(event_loc);

Note that the event_loc column must be a GEOGRAPHY column.

Once you’ve enabled SOS and Snowflake has started building the search paths, monitor the progress using SHOW TABLES. This will include a column labelled search_optimization_progress. Once this displays 100, the search paths are built, and SOS is ready for use!

Can’t stop now, I’ve traveled so far

Let’s use what we’ve built

Now that SOS is set up, it’s time to leverage it. Write some queries that use your brand new search paths!

There are several types of Search Optimization that you can apply to a column, and they will all trigger on different types of search.

Equality SOS

For equality SOS, Snowflake will automatically look for opportunities to apply the search paths when a query contains equality predicates over a column with the equality search paths built. Examples include column1 = 1234, or IN predicates such as column1 IN (1234, 1235, 1266). There are a couple important things to understand about this. First, the more values in the IN (or OR) clause, the less effective SOS will be, as the wider search will return more micro-partitions. The compiler accounts for this when evaluating whether to leverage SOS and so searches for a large range of values can occasionally drop the use of the feature. It’s also important to keep the search predicates reasonably simple — CASE statements, subqueries and formulas can all prevent Search Optimization from being triggered. Casting can also be fraught — casting the search value is ok, but casting the column is not.

Substring SOS

For substring SOS, Snowflake will automatically look for opportunities to apply the search paths when you use a substring search predicate, such as LIKE, ILIKE, RLIKE, CONTAINS, or REGEXP. This is especially useful as there are no other ways to speed up substring searches if the substring in question isn’t always at the beginning of the string. For example, use an ILIKE pattern to find any mention of fraud in call transcripts: transcript ILIKE ‘%fraud%’ or certain emails in notes: email RLIKE $$.*email=[\w\.]+@snowflake\.com.*$$. One of the important parts of leveraging substring queries is that the search term must be greater than five characters. Searches for substrings less than 5 characters will not use the search path. Similarly, regular expressions must include a string literal that is at least 5 characters long.

Geo SOS

For GEOGRAPHY columns with Geo SOS, Snowflake will automatically look for opportunities to apply the search paths when a geospatial search predicate, such as ST_INTERSECTS, ST_CONTAINS, ST_WITHIN, ST_DWITHIN, ST_COVERS or ST_COVEREDBY is used over a GEOGRAPHY column. This is especially useful as it can help improve performance over an inherently multi-dimensional field. It is otherwise difficult to optimise these kinds of queries. The primary challenge for Geospatial searches is to properly understand the likely distribution of data throughout the table to ensure that SOS is still useful. In some instances, it can be beneficial to also cluster by a geocode of some description such as Geohash. Clustering will keep the responding micro-partitions for a given search relatively tightly packed. A more complete discussion of Geospatial query optimization is the topic of an upcoming post on this blog — make sure to sign up for notifications on new articles!

Variants and Semi-Structured Data Types

For variant and other semi-structured column types, Snowflake will look for opportunities to apply the search paths based on whether you specify an equality search path, a substring search path, or both:

  • An equality search path in a variant column will support searches for straight up equality predicates, such as my_variant:my_field = ‘constant’. It will also support far broader equality searches, using functions such as ARRAY_CONTAINS and ARRAY_OVERLAP, or checking to see if a field in the variant even exists with my_variant:my_field IS NOT NULL or IS_NULL_VALUE(my_variant:my_field). Note though that IS NULL only supports checking the column itself, not a path, while IS_NULL_VALUE refers to JSON nulls, not SQL nulls. If these differences are important, make sure to double check the documentation!
  • A substring search path over a variant column will perform a similar function to substring SOS over a non-variant field, enabling substring searches both on any field and any subfield within a variant.

Verifying Usage of SOS

To understand when Search Optimization has been used for a query, look at the query profile to see a Search Optimization Access operator in place of the normal TableScan operator. The query may also be much faster!

For more information and details on each item in this section, it’s worth reading through the documentation .

Looks like SOS has finally found me

What sort of benefits can I expect from Search Optimization Service?

The largest benefit of SOS is the ability to massively reduce scans, and scanning time. As a result, queries that would previously have been large table scans can frequently be reduced to scanning just a few micro-partitions. Often, this means that a significantly smaller warehouse can be used to save a significant amount of credit cost.

As an example of the power of the equality search, the TPCH_SF1000 Lineitem table is approximately 160GB and contains 6 billion records. Searching for all the records associated with a single orderkey requires almost a full table scan, scanning 10,314 micro-partitions out of a total of 10,336. This takes 1 minute and 1 second on an XS warehouse. However, with Search Optimization applied, only 7 micro-partitions are scanned — reducing the total time of the same query on the same size warehouse down to 3.7s.

For an example of a substring search, consider the Wikidata dataset, which you access as part of the Search Optimization Quickstart linked at the end of this article. In querying this dataset, we might be interested in any item where the description mentions the word ‘shave’. Without Search Optimization, this is (almost) guaranteed to be a full table scan — no matter how many micro-partitions are in the table. This is because Snowflake is generally unable to prune micro-partition selection down based on the range-based metadata for a substring. (However, it is possible to prune out micro-partitions where all values are NULL or have a length below 5!) With Search Optimization for substring searches applied, this query scans only 260 of the total 5884 micro-partitions in the table, reducing the time to run the query on an XS warehouse from 24s all the way down to 5.1s.

These effects get larger as your data gets larger. For examples of customers with very large datasets who have used Search Optimization to improve their query times, take a look at this blog post from Snowflake’s Inside The Data Cloud.

(And you know, you just can’t hide)

What do I need to keep in mind while using Search Optimization Service?

Although in general Search Optimization Service is designed to minimize surprises, there are still some situations where behaviour might be different than expected. It’s important to keep these factors in mind to leverage the feature to its full potential.

Constant Operation

The Search Optimization Service leverages an asynchronous serverless approach to maintaining search paths. Inevitably, this sometimes means that queries will arrive referencing the table after new data is added, and before the search path for that data can be built. Snowflake ensures that results are kept up to date by scanning the new micro-partitions as if Search Optimization wasn’t enabled, and pruning the old micro-partitions with their search paths. However, this can result in slower query times if there is a significant number of new micro-partitions that have to be scanned.

Substrings over five characters

The substring function of Search Optimization Service only works for substring searches over five characters. It is not considered for searches where there is no substring literal of five characters or greater. Consider the following examples:

-- Example 1
SELECT * FROM table WHERE c1 LIKE ‘%TEST%’;

-- Example 2
SELECT * FROM table WHERE c1 RLIKE '.*tel=[0–9]{3}-?[0–9]{3}-?[0–9]{4}.*';

-- Example 3
SELECT * FROM table WHERE c1 RLIKE $$.*email=[\w\.]+@snowflake\.com.*$$;

-- Example 4
SELECT * FROM table WHERE c1 LIKE '%SEARCH%IS%OPTIMIZED%';

-- Example 5
SELECT * FROM table WHERE c1 RLIKE '.*st=(CA|AZ|NV).*( →){2,4}.*';

Of these, examples 1 and 2 will not utilize SOS as they do not contain a substring literal over five characters long. Examples 3 and 4 may utilize SOS, as they contain substrings over five characters long. Example 5 might appear at first glance to be ineligible, as there is no five character long substring literal. However, Snowflake will automatically expand the three potential items into st=CA, st=AZ, and st=NV, which are each 5 characters long and therefore eligible.

Geospatial Queries and Clustering

The nature of many geospatial datasets and queries is that certain areas are much more likely to be common throughout the dataset and be commonly searched. This is particularly the case with datasets that track human events, which generally happen more frequently near population centres. A dataset naturally clustered by ingestion time will also naturally distribute events from the most populous areas into many micro-partitions. This causes poor pruning on the search access paths for searches over those populous areas, leading to negligible benefit from SOS. It is therefore wise to consider combining SOS with a cluster key that will minimise the spread of micro-partitions that will contain data for a given area. Unfortunately it isn’t possible to cluster on Geospatial column types. However, a geocode such as H3 or Geohash can be used instead.

Search Path Overhead

The process of downloading and scanning the search path has a time cost that is generally recovered by scanning significantly fewer micro-partitions in the table scan process. However, if the search path does not prune many micro-partitions, this can lead to a slower query than would have occurred had Snowflake not scanned the search path. There are several techniques that Snowflake uses to avoid this, which generally lead to SOS not triggering for queries where it might be expected. The most common reason for this is large IN lists, where several keys are searched for at once. If this is a common use case, SOS can be combined with Query Acceleration Service to increase the speed of queries where SOS is judged to be ineffective. This will reduce the variability of queries that do not utilise SOS.

Maintenance on tables with high churn

Changes to a table with SOS, including changes triggered by Automatic Clustering, will cause the search paths to require maintenance in proportion to the magnitude and frequency of the changes. As a result, making frequent UPDATEs, MERGEs and DELETEs to a table can cause the cost of maintenance to increase. Append only data models can, therefore, be significantly more cost-effective than data models that require this DML to be applied.

Reducing Cost with SOS

Once SOS usage is embedded in your organization, it’s time to take a look at whether your warehouse sizing for Search use cases is still appropriate. Often, you’ll find that where you previously had a perfectly sized warehouse, you will now be able to reduce this size. This can potentially result in significant savings, particularly when the warehouse is under consistent use.

Next track

Search Optimization Service is a great tool to use to reduce query response times and decrease the total cost of your Snowflake search workloads. Now that you’ve learned more about it, you probably want to give it a try! Check out the quickstart, or if you’d like to read more, the Snowflake documentation has a guide on implementing Search Optimization Service.

I want to hear about how you’re using Search Optimization in your Snowflake environment! If you’re using Search Optimization Service, put a comment below to let me know what you’re using it for, and whether you’re seeing it make a difference. If you’re about to try it for the first time, let me know what you think your first SOS use case will be. If you’re not using SOS even though you just read 4,000 words on the subject, let me know in the comments which Snowflake performance subject you’d like me to cover next!

[1]: For more information on how Snowflake charges for Search Optimization, please see the Pricing Guide and Consumption Table.

[2]: Note that as of October 2023, substring predicates on variant/semi-structured data are in Public Preview.

--

--