Optimize Your Hierarchical Data with Snowflake, Part Three

Saša Mitrović
6 min readFeb 15, 2022

Using Snowflake to Maximize the Value of Data in Hierarchies

The Snowflake Data Cloud offers virtually limitless storage and compute capacity. Combined with native support for extremely performant storage and processing of hierarchical data structures, these features enable rapid and reliable root cause and impact analysis across complex data models and processes.

This is Part Three of a three-part blog post that includes the following:

Example: Cost Center Hierarchy

This blog post shows an example of how Snowflake enables you to analyze cost center data. The example focuses on the data for Logistics in the cost center hierarchy shown in the following figure:

A cost center hierarchy, we’ll focus on S02000 — Logistics

For this example, we’d like to find out the details and the sum of all logistics costs. The data operation has two steps:

  1. Recursively iterate through the hierarchy to discover all cost centers under a specific ancestor.
  2. Join the discovered cost centers with cost information.

Let’s tackle both of these in Snowflake.

Recursive queries

This is a slightly simplified representation of how the source system models and stores the hierarchy data:

Relational representation of the cost center hierarchy

This model is good for storing and displaying the hierarchy, but legacy systems are not designed to efficiently query this type of nested data. With Snowflake’s recursive common table expressions (CTE), however, we can query this data as shown here:

WITH RECURSIVE cc AS
(
-— anchor clause (we’re starting with parent we’re interested in, using the id to identify it)
SELECT id, name
FROM staging.costcenter_raw
WHERE id = ‘S02000’
UNION ALL
-— the recursive clause will traverse all the branches leading down from the parent specified in the anchor clauseSELECT costcenter_raw.id, costcenter_raw.name
FROM staging.costcenter_raw, cc
WHERE staging.costcenter_raw.parentid = cc.id
)
SELECT id, name
FROM cc;

This recursive query traverses all existing paths between the “Logistics” cost center as defined in the anchor clause (in this case where id = ‘S02000’), down to each single leaf (a cost center that is not a parent to any other cost centers).

With this query we effectively find all children of “Logistics”, regardless of how deep these children cost centers are nested. This is what the result looks like:

Table with all children cost center of “Logistics”

For more information on using recursive CTEs, see the Snowflake documentation.

Joining with cost information

Finally, we can join this result to find all logistics costs:

WITH RECURSIVE cc AS
(
-- anchor clause (we’re starting with parent we’re interested in, using the id to identify it)
SELECT id, name
FROM staging.costcenter_raw
WHERE id = 'S02000'
UNION ALL-- the recursive clause will traverse all the branches leading down from the parent specified in the anchor clauseSELECT costcenter_raw.id, costcenter_raw.name
FROM staging.costcenter_raw, cc
WHERE staging.costcenter_raw.parentid = cc.id
)
SELECT final.cost.*
FROM cc, final.cost -- joining with a "cost" table in schema "final"
WHERE final.cost.costcenterid= cc.id;

Example: Traversing Millions of Hierarchy Paths

With a large-sized warehouse, Snowflake can process 400 million hierarchy paths in 60 seconds.

That’s incredibly fast! However, in a global company which has more than one-hundred thousand cost centers and many financial controllers analyzing this data, the costs of running these recursive queries each time could still be high.

Snowflake’s first-class support for semi-structured data, such as JSON, makes this analysis even faster and more cost-effective. It is able to use semi-structured data, for instance, to join with structured data in columns.

This example modifies the recursive query above to capture additional data:

WITH RECURSIVE create_paths AS
(
—- anchor (top-most parent, we’re using the id to identify it)
SELECT ‘“‘ || parentid || ‘“‘ AS parents, id, name
FROM staging.costcenter_raw
WHERE id = ‘S01000’
UNION ALL—- recursive clause will keep concatenating parent ids from the temp view “create_paths”SELECT concat( ifnull(create_paths.parents || ‘,’,’’) , ‘“‘ || costcenter_raw.parentid || ‘“‘), costcenter_raw.id, costcenter_raw.name
FROM staging.costcenter_raw, create_paths
WHERE staging.costcenter_raw.parentid = create_paths.id
)
SELECT id, name, parse_json( ‘[‘ || ifnull(parents || ‘,’,’’) || ‘“‘ || id || ‘“]’) AS hierarchy
FROM create_paths;

As it traverses the paths, the recursive clause will build a string containing all the ancestors of each leaf. The final select will convert that string containing ancestors separated by commas into a JSON array.

In the final step, we can wrap this query with:

CREATE OR REPLACE TABLE final.costcenter as (

)

So the final query looks like this:

create or replace table final.costcenter as (
WITH RECURSIVE create_paths AS
(
-— anchor (top-most parent, we’re using the id to identify it)
SELECT ‘“‘ || parentid || ‘“‘ as parents, id, name
FROM staging.costcenter_raw
WHERE id = ‘S01000’
UNION ALL-- recursive clause will keep concatenating parent ids from the temp view “create_paths”
SELECT CONCAT( IFNULL(create_paths.parents || ‘,’,’’) , ‘“‘ || costcenter_raw.parentid || ‘“‘), costcenter_raw.id, costcenter_raw.name
FROM staging.costcenter_raw, create_paths
WHERE staging.costcenter_raw.parentid = create_paths.id
)
SELECT id, name, parse_json( ‘[‘ || ifnull(parents || ‘,’,’’) || ‘“‘ || id || ‘“]’) AS hierarchy
FROM create_paths
);

With this query, we calculated and materialized the paths for each cost center, starting from the top parent. This enables us to easily find out all cost centers under a given ancestor, regardless of how many levels (or tiers) exist, using an extremely time- and cost-efficient query in Snowflake.

This is now the content (partial) of our cost center table:

Table with materialized paths for each node in the hierarchy

We can now use this simple query to show costs associated with logistics:

SELECT final.cost.*
FROM final.costcenter,
final.cost,
LATERAL FLATTEN(input => hierarchy)
WHERE value =’S02000'
AND final.cost.costcenterid = final.costcenter.id;

This query uses Snowflake’s function flatten() to produce an inline view that contains a row for each JSON array member. The lateral modifier enables us to use that view in our join or where predicates.

In this case, we’re looking for any cost center that has value S02000 in the JSON array. Any cost center that meets this criteria is a descendant of S02000 or S02000 itself.

Finally, we’re joining these cost centers with the cost table to get a list of all costs that belong to the hierarchy of S02000 — “Logistics”. It’s hard to believe that this simple query is actually querying such a complex data structure as a hierarchy.

Conclusion

Hierarchies are everywhere — in the nature, in human societies, and in our data. Understanding how to analyze the data in a hierarchy can be critical to a company’s success. With the built-in power of Snowflake’s recursive function and its optimization of semi-structured JSON, you can analyze hierarchies and improve your company’s analytic reporting.

Wait, there’s more!

In the next post I’ll show how we can use BI tools, such as Tableau, to work with hierarchies stored in Snowflake. Stay tuned!

Views and opinions expressed in this article are my own and do not represent that of my place of work. I expressly disclaim any liability or loss incurred by any person who acts on the information, ideas or strategies discussed in my stories on Medium.com. While I make every effort to ensure that the information I’m sharing is accurate, I welcome any comments, suggestions, or correction of errors.

--

--