Faster Snowflake UDFs and policies with ‘memoizable’

Let’s find out what the new `memoizable’ SQL UDFs are, their effects, and limitations. Then let’s put them to work with an example that shows how to improve the performance of complex security policies.

Image generated by AI

The effects of `memoizable`

Memoizable UDFs behave like smart session variables that remember the results of the UDF even beyond a session. These UDFs will only recompute the results if the underlying data changes, leading to improved performance as results are not computed repeatedly.

For example, consider the following two UDFs:

create or replace function whale_customers()
returns array
as $$
select array_agg(o_custkey)
from (
select o_custkey, sum(o_totalprice) total
from snowflake_sample_data.tpch_sf100.orders
group by 1
having total>6900000
)
$$;

create or replace function whale_customers()
returns array
memoizable
as $$
select array_agg(o_custkey)
from (
select o_custkey, sum(o_totalprice) total
from snowflake_sample_data.tpch_sf100.orders
group by 1
having total>6900000
)
$$;

The two UDFs are identical, except for the keyword “memoizable” in the 2nd definition. The 1st UDF returns an array of integers identifying customers who have spent more than $6.9 million, while the 2nd UDF returns an array of strings identifying the same set of customers. The 1st UDF typically takes 3.2 seconds to run (on a small warehouse), while the 2nd UDF takes only 0.2 seconds, except for the first time it’s run (or when the cache expires).

The difference in performance can be significant, especially if you plan to use the same results multiple times.

Limitations

Note thatmemoizable UDFs are currently in public preview, and they come with certain limitations:

  • It only works on SQL UDFs.
  • It only works for UDFs without parameters.
  • It only works for UDFs that return scalars or arrays.
  • When returning arrays, the UDF returns an array of strings (Public Preview current behavior).
  • Each memoizable function has a 10 KB limit for its results. It’s a good fit for computations that are expensive to run, but produce small results.

Check out the docs for more details:

Note

There are also immutable UDF — which have similar performance improvements, but not quite the same: Theimmutable performance improvements only work within one query, but immutable UDFs can accept parameters and support more languages than just SQL UDFs.

Memoizable in action: Efficient Governance and Policies

Let’s consider a simple scenario to demonstrate how memoizable UDFs can be useful. We’ll imagine an account with two roles — accountadmin and sysadmin. The accountadmin can see everything, but we want the sysadmin to only see data from our “whale” customers. We can easily enforce this restriction using a row-level access policy:

use role accountadmin;

create or replace row access policy customer_policy_row
as (c_custkey integer)
returns boolean ->
true
;

alter row access policy customer_policy_row set body ->
'ACCOUNTADMIN' = current_role()
or exists (
select o_custkey, sum(o_totalprice) total
from snowflake_sample_data.tpch_sf100.orders
where c_custkey = o_custkey
group by 1
having total>6900000
)
;

Let’s apply this policy to a table (here a copy of the sample TPCH customer table):

-- create or replace table customer
-- as
-- select *
-- from snowflake_sample_data.tpch_sf10.customer
-- order by c_custkey;

-- grant select on table customer to role sysadmin;

alter table customer
add row access policy customer_policy_row on (c_custkey)
;

That was easy, and a good time to note that:

  • Row-level, column-level, and data-masking policies are easy to define, test, and maintain in Snowflake. You can write them once, apply them to many tables, and keep updating them in a single place. Compare this with your other favorite tools, if you’d like to see the difference.
  • Row-level, column-level, and data-masking policies are powerful and flexible in Snowflake. Here we are defining a lookup that depends on the results of another table.
  • Row-level, column-level, and data-masking policies require Snowflake enterprise edition and above account. If you are playing with these examples, make sure you’re not a standard edition account.
  • You shouldn’t be playing around with ACCOUNTADMIN. Try these examples only in a playground enterprise edition account, or talk to your account admin to enable good policies and delegation to the securityadmin.

Let’s try our fresh row-level policy:

use role accountadmin;
select count(*), current_time()
from customer;
-- 1,500,000, 0.1s

use role sysadmin;
select count(*), current_time()
from customer;
-- 4, 3.4s
The sysadmin role can only see 4 rows now, given our security policy
  • An accountadmin can see the 1.5m rows, and the query returns in 0.1s.
  • This is good — the row level policy returns as soon as it knows that anaccountadmin is querying, so it doesn’t need to compute the slow lookup.
  • A sysadmin can only see 4 rows, and the query returns in 3.4s.
  • This is good, because the row-level policy is working — but we gave it a policy designed to be slow. 3.4s is pretty slow compared to 0.1s.
  • The question then becomes how to get the row-level policy framework to act as fast as it normally is — while having it handle a complex lookup (like the one I designed here on purpose, that adds >3s when applied).
Image generated by AI

Making it fast with memoizable

Let’s redefine our row-level policy, so it can leverage a memoizable UDF. First let’s create a memoizable UDF like the one we had above:

use role accountadmin;
create or replace function whale_customers()
returns array
memoizable
as $$
select array_agg(o_custkey)
from (
select o_custkey, sum(o_totalprice) total
from snowflake_sample_data.tpch_sf100.orders
group by 1
having total>6900000
)
$$;

Which allows us to simplify our row-level policy:

alter row access policy customer_policy_row set body -> 
'ACCOUNTADMIN' = current_role()
or array_contains(c_custkey::string::variant, whale_customers())
;

-- try
-- or array_contains(c_custkey::string::variant, whale_customers())

Let’s check the querying performance again:

use role accountadmin;
select count(*), current_time()
from customer;
-- 1,500,000, 0.1s

use role sysadmin;
select count(*), current_time()
from customer;
-- 4, 0.1s

See? Now the results are always fast. We get the desired number of rows for each role, but it only takes 0.1s to get them — for both roles. Except for the first time we try this, as the UDF will be computed once until the underlying data changes or the cache expires.

Tip: A better syntax for testing policies

Snowflake also supports execute using policy_context() that you can use to script test for policies:

execute using policy_context(current_role => 'PUBLIC') 
as
select * from empl_info;

Caching notes

I added current_time() to the benchmarking queries to make sure the result cache is not being used. You might be tempted to alter session set use_cached_result = false; instead, but that will also disable the memoizable cache — rendering these tests moot.

Next steps

  • Play with memoizable, and share your results with us.
  • Note how with memoizable you don’t need to worry about consistency or cache invalidation. If the underlying data changes, Snowflake automatically recomputes the function the next time it is used.
  • Try this with column-level security and data masking. Snowflake offers a unified design for all these features.
  • If you are evaluating Snowflake against other alternatives, try to see if they have this kind of flexibility, power, and performance that can satisfy your governability requirements.
Image generated by AI

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.