Lean Analytics with Snowflake Cortex and dbt Cloud — the OpenSpace.ai way

Authors: Ripu Jain (Snowflake), Jonathan Paquette (OpenSpace.ai)

Thank you for stopping by to read this, but let us save you some time and tell you what this blog is NOT about:

  • It’s not a write up on how to solve Netflix scale data engineering and analytics problems
  • It’s not a best practices document or a write up on data architecture patterns

Instead, we hope that this blog conveys that:

  • It’s possible for a lean data team of 3 to serve the analytical needs of a ~ 200 employee, Series D, ~ $1B valuation company, without the need for expensive data architects or consultants
  • Working with the right tech stack, even if perceived expensive, can bring joy in work and empower you to focus on business impacting analyses and decision making, rather than working and spending time maintaining your own infrastructure.
  • Keeping it simple should be the goal; introducing new tech or features to optimize for automation or observability capabilities, or some other single-dimensional functionality — be cautious about them if you are striving to remain a lean team.

Great! So what is OpenSpace.ai? And what problems is the data team addressing?

OpenSpace.ai is Reality Capture software. It’s most similar to Google Maps Street View — the main difference is that OpenSpace.ai works indoors, and you control where the 360 Imagery happens.

I know what you must be thinking — OpenSpace must generate insane amounts of spatial and geometric data! And while that is true, and OpenSpace does have phenomenal AI and ENG teams working on that, the data team focuses primarily on solving the needs of our business units, which involves:

  • Generating standardized and up-to-date facts and metrics used by entire company — from execs to field level
  • Analyzing product usage data and determining ways to improve. What features are popular? What are common usage patterns? Who are outliers?
  • Deliver sales and marketing analyses, like pipelines and efficiency.
  • Quantify sales enablement — propensity to expand, to churn, gross margin etc.

So, what are the primary data sources to address the above needs:

  • PostgreSQL production database, hosted on AWS, generating about 15 million rows per month
  • Salesforce for CRM
  • Marketo for Marketing
  • Amplitude for engagement tracking — clickstream, engagement metrics, user segmentation
  • Jira for internal issue tracking
  • SAGE INTACCT (financial data source)

For context, it’s good to mention the constraints the data team works under:

  • The data team comprises of 3 individulas, with not-too-heavy background in core data engineering, or cloud computing, or big data technologies
  • The team wants to spend minimal time maintaining and integrating and upgrading data-stack. We want the tech to just work and disappear behind the scene, and be an enabler and a decision-making accelerator for our company

At a high level, this is what openspace data landscape looks like:

With this context in mind…

Enter Snowflake — the analytics accelerator

Stealing a quote from Collin, the CEO of Omni — [Snowflake makes] life easy for data people.

Snowflake’s ease of setup, use and administration got us addressing our business needs and goals right away — all we needed was our data to work with. Snowflake became our organization’s trusted, governed, and secure source of truth.

Here are some of the specific Snowflake features have tremendously helped our team:

  1. SELECT * EXCLUDE
    This one is small but mighty, and it shows up in almost every single query and dbt model that we write. We have embraced denormalization, which means that we generally include all columns from upstream models in their downstream dependencies. This has massively improved our ability to generate analytics fast, and it would not be possible with SELECT * EXCLUDE.
  2. Recursive CTE:
    As a B2B business, our customers come in all shapes and sizes. Some of them can be defined by one record in a table, and others have complex hierarchical structures. The traditional SQL for unraveling complex hierarchies is notoriously complicated, but Snowflake lets us do it with a single recursive CTE.
    Here’s how we use it:
with
accounts as (
select
account_id,
account_parent_id
from {{ ref("stg_salesforce__account") }}
),

accounts_cte as (
select
account_id,
account_parent_id,
account_id as root_account_id,
1 as level
from accounts
where account_parent_id is null
union all
select
a.account_id,
a.account_parent_id,
cte.root_account_id as root_account_id,
level + 1 as level
from accounts as a
inner join accounts_cte as cte on a.account_parent_id = cte.account_id
),

final as (
select
account_id as sfdc_account_id,
account_parent_id as sfdc_parent_account_id,
root_account_id as sfdc_root_account_id,
max(level) over (partition by root_account_id) - level + 1 as sfdc_account_hierarchy_level,
max(level) over (partition by root_account_id) as sfdc_root_account_hierarchy_level
from accounts_cte
)

select * from final

And here’s the output. You can see how we can immediately determine the parent and root accounts for any child account — super helpful!

3. Lateral flatten:
JSON and arrays are great for some things, but they aren’t ideal for analytics. Before Snowflake, the SQL to extract information from them was a bit of a nightmare. Snowflake’s lateral flatten join gave us a simple, scalable way to manage JSON, and has allowed us to scale faster.

4. Snowflake Cortex:
Snowflake Cortex, now in Public Preview, gives us instant access to industry-leading large language models (LLMs) like Mistral and llama2.

We just deployed our first Cortex powered dbt model! (And coincidentally, as of writing this blog, dbt Labs just published their own use case of using Cortex in analytics workflow.)

Our first use case is to generate a summary report of trends in Customer Support tickets. This allows us to keep a finger on the heartbeat of our customers with very little effort. Reading 500 cases is very hard for a human, but very easy for an AI. Reading the AI summary is a great way to quickly understand high level trends of customer complaints.

{{ config(materialized='table') }}

with
support_cases as (select * from {{ ref('sfdc_case_lookup') }}),

base as (
select
ROW_NUMBER() OVER ( ORDER BY SFDC_CASE_CLOSED_DATE DESC ) as row_number,
truncate(row_number, -2)::varchar as bin,
COALESCE(SFDC_CASE_ORIGIN, 'NULL') as CASE_ORIGIN,
COALESCE(SFDC_CASE_SUBJECT, 'NULL') as CASE_SUBJECT,
COALESCE(SFDC_CASE_SUPPORT_COMPONENT, 'NULL') as CASE_SUPPORT_COMPONENT,
COALESCE(LEFT(SFDC_CASE_DESCRIPTION, 3000), 'NULL') as TRUNCATED_CASE_DESCRIPTION,
COALESCE(LEFT(SFDC_CASE_CLOSURE_COMMENTS, 500), 'NULL') as TRUNCATED_CASE_CLOSURE_COMMENTS,
CONCAT(
'Read the following customer support case and generate a clear, concise summary in 50 words that highlights the main issue and resolution (if applicable). ',
'<case_origin>', CASE_ORIGIN, '</case_origin>',
'<case_subject>', CASE_SUBJECT, '</case_subject>',
'<case_support_component>', CASE_SUPPORT_COMPONENT, '</case_support_component>',
'<truncated_case_description>', TRUNCATED_CASE_DESCRIPTION, '</truncated_case_description>',
'<case_closure_comments>', TRUNCATED_CASE_CLOSURE_COMMENTS, '</case_closure_comments>'
) as CORTEX_PROMPT,
SNOWFLAKE.CORTEX.COMPLETE(
'mixtral-8x7b',
cortex_prompt
) as cortex_summary
from support_cases
where SFDC_CASE_CLOSED_DATE is not null
and CASE_SUPPORT_COMPONENT not in ('Spam/Other Dept')
order by SFDC_CASE_CLOSED_DATE desc
limit 499
),

array_of_case_summaries as (
select
bin,
array_agg(cortex_summary) as array_of_case_summaries
from base
group by 1
),

summary_of_summaries as (
select
bin,
SNOWFLAKE.CORTEX.COMPLETE(
'mixtral-8x7b',
concat('Below are summaries of 100 customer support cases. Please read and provide a comprehensive summary that captures the prevalent issues and notable patterns observed in these cases. ',
'<array_of_case_summaries>', array_of_case_summaries::varchar, '</array_of_case_summaries>')
) as summary_of_summaries
from array_of_case_summaries
),

array_of_summary_of_summaries as (
select
array_agg(summary_of_summaries) as array_of_summary_of_summaries
from summary_of_summaries
),

summary_of_summary_of_summaries as (
select
SNOWFLAKE.CORTEX.COMPLETE(
'mixtral-8x7b',
concat('Here is a collection of summaries that represent overarching themes and patterns from 500 customer support cases. Based on these, generate a final, detailed summary that distills the core insights and prevalent customer issues. ',
'<array_of_summary_of_summaries>', array_of_summary_of_summaries::varchar, '</array_of_summary_of_summaries>')
) as summary_of_summary_of_summaries
from array_of_summary_of_summaries
)

select
base.*,
array_of_case_summaries.* exclude bin,
summary_of_summaries.* exclude bin,
array_of_summary_of_summaries.*,
summary_of_summary_of_summaries.*
from base
left join array_of_case_summaries on base.bin = array_of_case_summaries.bin
left join summary_of_summaries on base.bin = summary_of_summaries.bin
left join array_of_summary_of_summaries
left join summary_of_summary_of_summaries

Report Output:

dbt Cloud — the scale multiplier

As mentioned in the beginning of the post, our lean team did not have extensive experience with traditional software development practices. We were used to writing SQL in a browser tab and Python in a Jupyter notebook (usually a Jupyter notebook that was also in a browser tab).

dbt Cloud allowed us to get started, and we have continued to use it because of its ease of use and seamless job orchestration. Their best practices documentation are amazing guides that got us started the right way. Since we started using dbt last year, our codebase (~ 450 dbt models) has grown to prepare data for every corner of the business. You can see the contribution history in our GitHub here:

… in conclusion…

We hope this writeup gives you some confidence in our philosophy that the right data stack should disappear in the background, and empower the data team to focus on solving problems that help the business grow. For us, the combination of fivetran, Snowflake, and dbt cloud is that right stack, that’s allowing our team of 3 to serve an entire company’s analytics and decision making needs.

--

--