Unlocking Efficiency with Semi-structured Data: Leveraging Snowflake’s Higher-Order Functions
Image Credits : Pixabay
In today’s digital landscape, data holds the key to business success. From a simple analytics widget displaying quarterly performance in your CRM application to cutting-edge Gen AI-based application developed at a pace unheard of a decade ago, businesses globally store and depend more than ever on the wealth of insights from all kinds of data they possess — whether it is structured, semi-structured or unstructured data. Semi-structured data, in particular, is known for its less rigid nature compared to relational databases and offers flexibility. This flexibility allows for the incorporation of organizational properties like semantic tags and metadata, facilitating easier search functionalities based on hierarchies and categories. Formats such as JSON, Avro, and XML are common examples of semi-structured data.
Snowflake’s semi-structured data support has unlocked countless possibilities for users. This post delves into how higher-order functions can empower Snowflake users to perform advanced data transformations and analysis easily, providing a powerful tool for working with semi-structured data in the cloud data platform.
So what is a higher-order function? Definition from Wiki for a general mathematics and computer science term. Higher-order functions either take other functions as input or return other functions as output (or both). Higher-order functions are also known as functionals, and programming with them could therefore be called functional programming.
In Snowflake a higher-order function accepts an array, dictates how the array is processed, and specifies the outcome of the computation. It defers to the task of processing each item in the array to a lambda function.
Snowflake supports the storage of arrays and data structures like dictionaries or maps within a single VARIANT column, offering flexibility in data representation. JSON (JavaScript Object Notation) particularly serves as a pivotal method across various domains, offering a lightweight and readable format for data exchange. Widely employed in web development, APIs, and configuration files, JSON facilitates seamless transmission and storage of structured data. Its versatility extends to serialization, logging, and configuration management, making it a ubiquitous choice for data interchange and representation in modern applications and systems.
In a data engineering process, we gather sensor data and store it in a primary table. We first run data cleansing procedures to guarantee accuracy and uniformity. Next, we apply a set of business rules to enrich the data, making it ready for additional processing and analysis. This enhanced data can then be integrated seamlessly with other data sets and is transferred to the presentation layer. Here, it serves as the foundation for creating dynamic dashboards and reports, among other applications, to ensure its use for generating actionable insights.
In the integration layer, complex transformational logic is added using a variety of techniques. We’ll take a look at how this can be achieved with higher-order functions on semi-structured data. Plus, we will highlight the clear contrast between the conventional approach and the new streamlined function offered by Snowflake.
Problem Statement: Let’s take a hypothetical manufacturing customer who wants to monitor and transform parameters like temperature, humidity, and pressure streaming from IoT sensors. These sensors collect data, creating a continuous stream of JSON-formatted readings.
Goal: Detect irregularities in the sensor data, such as sudden temperature changes or pressure variations, which could signal potential issues in the manufacturing process.
For the above problem, one can take advantage of Snowflake’s higher-order functions that allow the use of anonymous functions to filter and modify both semi-structured and structured data. These functions are invoked using a Lambda expression to create an anonymous function, which then operates on the data specified within an array.
The higher-order functions available include:
- TRANSFORM
- FILTER
TRANSFORM in Action:
Step 1 — Imagine you have a table called sensor_readings with readings from different sensors in a factory stored in separate columns with date and timestamp. I want to query and detect the anomalies in my data.
-- Create a table to store sensor readings
CREATE TABLE sensor_readings (
sensor_readings VARIANT
);
-- Insert the JSON data into the table
INSERT INTO sensor_readings (sensor_readings)
SELECT PARSE_JSON('{"readings":[{"device_id":"sensor_001","temperature":25.3,"humidity":60.1,"pressure":1013.2},{"device_id":"sensor_002","temperature":28.7,"humidity":65.8,"pressure":1009.5},{"device_id":"sensor_003","temperature":32.1,"humidity":70.5,"pressure":1005.8}]}');
Step 2- Now I would like to analyze if there are any anomalies in the sensor readings. With a LATERAL FLATTEN, I have to use a query as below to get the output. As you can see this involves a lot of effort to ensure the data type, the notations, and overall a complex way to perform this repeatedly for every new schema type:
-- Thresholds for each sensor parameter
WITH thresholds AS (
SELECT 27.0 AS temperature_threshold,
65.0 AS humidity_threshold,
1010.0 AS pressure_threshold
)
-- Query to flag anomalies in sensor readings using LATERAL FLATTEN
SELECT
readings.value:device_id::STRING AS device_id,
readings.value:temperature::FLOAT,
readings.value:humidity::FLOAT,
readings.value:pressure::FLOAT,
CASE WHEN readings.value:temperature::FLOAT > (SELECT temperature_threshold FROM thresholds) THEN 'High' ELSE 'Normal' END AS temperature_anomaly,
CASE WHEN readings.value:humidity::FLOAT > (SELECT humidity_threshold FROM thresholds) THEN 'High' ELSE 'Normal' END AS humidity_anomaly,
CASE WHEN readings.value:pressure::FLOAT < (SELECT pressure_threshold FROM thresholds) THEN 'Low' ELSE 'Normal' END AS pressure_anomaly
FROM
sensor_readings,
LATERAL FLATTEN(input => sensor_readings.sensor_readings:readings) AS readings;
With a TRANSFORM Higher-order function, there are no longer any hassles around datatypes and notations. See the magic below :
--Thresholds for each sensor parameter
WITH thresholds AS (
SELECT 27.0 AS temperature_threshold,
65.0 AS humidity_threshold,
1010.0 AS pressure_threshold
)
-- Flag anomalies in sensor readings using TRANSFORM
SELECT TRANSFORM(
sensor_readings.sensor_readings:readings,
reading -> OBJECT_INSERT(
OBJECT_INSERT(
OBJECT_INSERT(
reading,
'temperature_anomaly',
CASE WHEN reading:temperature::FLOAT > (SELECT temperature_threshold FROM thresholds) THEN 'High' ELSE 'Normal' END
),
'humidity_anomaly',
CASE WHEN reading:humidity::FLOAT > (SELECT humidity_threshold FROM thresholds) THEN 'High' ELSE 'Normal' END
),
'pressure_anomaly',
CASE WHEN reading:pressure::FLOAT < (SELECT pressure_threshold FROM thresholds) THEN 'Low' ELSE 'Normal' END
)
) AS flagged_readings
FROM sensor_readings;
By providing a higher level of abstraction, TRANSFORM functions is easier to read, understand, and maintenance free compared to lateral flatten operations, which involves more manual steps.
FILTER in Action:
Let us leverage the same sensor readings table that contains readings from different sensors in a manufacturing facility.
Step 1 — I would like to filter sensor readings that cross a certain threshold and would like to consider only “good” readings values for training a Machine Learning model. For this I can leverage the FILTER funtion to sieve array elements greater than a value. FILTER can be used to for data cleansing operations such as removing NULL values
SELECT
FILTER(sensor_readings.sensor_readings, reading -> reading:temperature > 30.0) AS abnormal_temperature_readings
FROM
sensor_readings;
Step 2- The FILTER function is used to fetch rows that are abnormal and now easily one can fetch the good data for downstream tasks.
As seen in the above 2 cases, implementation of a custom logic for data transformation and filtering is easier than before using these Lambda expressions that defines the transformation logic on each array element.
Summary
In our ever-evolving world, it is vital for manufacturers and all industries to optimize efficiency and uphold data quality. This needs to be done in a consistent and modular way. With the rise of IoT devices, manufacturers can collect massive amounts of sensor data, giving a detailed view into production processes. But, making sense of this data quickly can be a tough task.
As a result, the need for sophisticated and robust techniques for managing this kind of data has become more critical than ever. In this detailed blog post, we have delved into the features of Snowflake’s higher-order functions for JSON transformation, and we compared and contrasted them with traditional methods such as the FLATTEN function. TRANSFORM AND FILTER support both semi-structured arrays and structured arrays.
Keep innovating!