Snowflake Supports Higher-order Functions
Snowflake’s support for higher-order functions provides a new way to handle semi-structured data. The newly supported TRANSFORM
and FILTER
functions offer a more concise, readable, and efficient approach for performing data manipulation and advanced analysis. In this blog, we will dive deeper into the benefits of these capabilities, explore how they work, and discuss their current limitations.
Why Higher-order Functions?
A higher-order function is a type of function that takes another function (lambda expression) as its input. There is a common need to loop over an array and perform some actions for each value in the array, which is precisely the problemTRANSFORM
and FILTER
address. These higher-order functions come with the following benefits:
Advanced Analytics: By simplifying the iteration over array elements, TRANSFORM
and FILTER
facilitate the implementation of custom logic for data transformation and filtering, streamlining analytical processes. Previously, this type of manipulation could only be done using LATERAL FLATTEN
operations or user defined functions (UDF).
Enhanced Developer Experience: Higher-order functions encapsulate the manipulation logic in lambda expressions, enabling more readable and maintainable SQL statements, and a more declarative and expressive approach of handling data transformations.
Avoiding Unnecessary UDFs: With higher-order functions, there is less need to create, maintain, and manage access to UDFs for ad-hoc array manipulation logic. This reduces overhead and simplifies the data manipulation processes.
How Do Higher-order Functions Work?
Imagine you have a table orders
with columns order_id, order_date, order_detail
, where the order_detail
column is an array of the line items, their purchase quantity and subtotal. The sample data below shows two rows of data:
CREATE OR REPLACE TABLE orders AS
SELECT 1 AS order_id, '2024-01-01' AS order_date, [{'item':'UHD Monitor', 'quantity':3, 'subtotal':1500}, {'item':'Business Printer', 'quantity':1, 'subtotal':1200}] AS order_detail
UNION SELECT 2 AS order_id, '2024-01-02' AS order_date, [{'item':'Laptop', 'quantity':5, 'subtotal':7500}, {'item':'Noise-cancelling Headphones', 'quantity':5, 'subtotal':1000}];
SELECT * FROM orders;
+----------+------------+-----------------------------------------------------------------------------------------------------------------------+
| ORDER_ID | ORDER_DATE | ORDER_DETAIL |
+----------+------------+-----------------------------------------------------------------------------------------------------------------------+
| 1 | 2024-01-01 | [{"item":"UHD Monitor","quantity":3,"subtotal":1500},{"item":"Business Printer","quantity":1,"subtotal":1200}] |
| 2 | 2024-01-02 | [{"item":"Laptop","quantity":5,"subtotal":7500},{"item":"Noise-cancelling Headphones","quantity":5,"subtotal":1000}] |
+----------+------------+-----------------------------------------------------------------------------------------------------------------------+
Now for each order, I want to fetch only the items whose subtotal is at least $1,500. Without the support of higher-order functions, we’ll usually rely on one of the two approaches:
Lateral Flatten:
SELECT order_id, order_date, ARRAY_AGG(value) WITHIN GROUP (ORDER BY seq)
FROM orders o,
LATERAL FLATTEN(input => o.order_detail)
WHERE value:subtotal >= 1500
GROUP BY ALL;
+----------+------------+-------------------------------------------------------+
| ORDER_ID | ORDER_DATE | ORDER_DETAIL |
+----------+------------+-------------------------------------------------------+
| 1 | 2024-01-01 | [{"item":"UHD Monitor","quantity":3,"subtotal":1500}] |
| 2 | 2024-01-02 | [{"item":"Laptop","quantity":5,"subtotal":7500}] |
+----------+------------+-------------------------------------------------------+
UDF:
CREATE OR REPLACE FUNCTION order_filter(ORDER_DETAIL ARRAY, ITEM_SUBTOTAL float)
RETURNS ARRAY
LANGUAGE JAVASCRIPT
AS
$$
return ORDER_DETAIL.filter((ITEM) => ITEM.subtotal >= ITEM_SUBTOTAL);
$$;
SELECT order_id, order_date, order_filter(order_detail, 1500)
FROM orders o;
Higher-order Functions:
Using higher-order functions, this is one simple function call, making the query much easier to read and save all the effort of re-assembling the arrays or having to define named UDFs.
SELECT order_id, order_date, FILTER(o.order_detail, i -> i:subtotal >= 1500)
FROM orders o;
There are, of course, other usages you can creatively apply higher-order functions on. Just to name a few:
-- Insert a new key "unit_price" to each item using the orders table above
SELECT order_id, order_date, TRANSFORM(o.order_detail, i -> OBJECT_INSERT(i, 'unit_price', (i:subtotal / i:quantity)::NUMERIC(10,2)))
FROM orders o;
+----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| ORDER_ID | ORDER_DATE | ORDER_DETAIL |
+----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | 2024-01-01 | [{"item":"UHD Monitor","quantity":3,"subtotal":1500,"unit_price":500},{"item":"Business Printer","quantity":1,"subtotal":1200,"unit_price":1200}] |
| 2 | 2024-01-02 | [{"item":"Laptop","quantity":5,"subtotal":7500,"unit_price":1500},{"item":"Noise-cancelling Headphones","quantity":5,"subtotal":1000,"unit_price":200}] |
+----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
-- Delete the key “quantity” from each order
SELECT order_id, order_date, TRANSFORM(o.order_detail, i -> OBJECT_DELETE(i, 'quantity'))
FROM orders o;
+----------+------------+---------------------------------------------------------------------------------------------+
| ORDER_ID | ORDER_DATE | ORDER_DETAIL |
+----------+------------+---------------------------------------------------------------------------------------------+
| 1 | 2024-01-01 | [{"item":"UHD Monitor","subtotal":1500},{"item":"Business Printer","subtotal":1200}] |
| 2 | 2024-01-02 | [{"item":"Laptop","subtotal":7500},{"item":"Noise-cancelling Headphones","subtotal":1000}] |
+----------+------------+---------------------------------------------------------------------------------------------+
-- Find out the item whose purchase subtotal matches the largest purchase subtotal
SELECT order_id, order_date, FILTER(order_detail, i -> i:subtotal = (SELECT MAX(ARRAY_MAX(TRANSFORM(order_detail, i -> i:subtotal))) FROM orders)) AS largest_item
FROM orders
WHERE ARRAY_SIZE(largest_item) > 0;
+----------+------------+---------------------------------------------------+
| ORDER_ID | ORDER_DATE | ORDER_DETAIL |
+----------+------------+---------------------------------------------------+
| 2 | 2024-01-02 | [{"item":"Laptop","quantity":5,"subtotal":7500}] |
+----------+------------+---------------------------------------------------+
Please note that TRANSFORM
and FILTER
support both semi-structured arrays and structured arrays. Our documentation has more details about typing.
Summary
We believe that higher-order functions can be beneficial for any data engineers and data scientists who would like to process semi-structured data. For more details on this feature, please refer to our documentation for TRANSFORM and FILTER.