<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Jonathan Talmi on Medium]]></title>
        <description><![CDATA[Stories by Jonathan Talmi on Medium]]></description>
        <link>https://medium.com/@jtalmi?source=rss-614576d16221------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*fmAvUcBN3b1hU3ws</url>
            <title>Stories by Jonathan Talmi on Medium</title>
            <link>https://medium.com/@jtalmi?source=rss-614576d16221------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 25 May 2026 22:39:30 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@jtalmi/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[dbt at Super Part 3: Observability]]></title>
            <link>https://medium.com/super/dbt-at-super-part-3-observability-c8755109901f?source=rss-614576d16221------2</link>
            <guid isPermaLink="false">https://medium.com/p/c8755109901f</guid>
            <category><![CDATA[data-observability]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[analytics-engineering]]></category>
            <category><![CDATA[dbt]]></category>
            <dc:creator><![CDATA[Jonathan Talmi]]></dc:creator>
            <pubDate>Tue, 31 Jan 2023 15:45:42 GMT</pubDate>
            <atom:updated>2023-02-01T23:12:22.189Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*e3hsBYKlvL78qZTG.jpeg" /></figure><p>In previous posts, we wrote about how we <a href="https://medium.com/super/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072">orchestrate</a> and run <a href="https://medium.com/super/dbt-at-snapcommerce-part-2-continuous-integration-260d4e782eba">continuous integration</a> pipelines on our dbt project at <a href="http://super.com">Super</a>. In our post about orchestration, we explained how the needs of our business required us to invoke dbt in response to specific triggers or events, often within complex data pipelines, and on a schedule. In our post on CI, we laid out the checks and balances in our dbt project repo that ensure all new models are thoroughly tested, documented, and conform to our style guide and data model.</p><p>With Coalesce 2022 happening in October 2022, I want to highlight a few projects and talks that touch on these two subjects:</p><ul><li>Dylan Hughes from Prefect gave a <a href="https://www.youtube.com/watch?v=VLHZQ5NI360">talk</a> about how to set up event-driven orchestration for dbt using Prefect</li><li>Kshitij Aranke from Vouch (now dbt Labs) gave a <a href="https://www.youtube.com/watch?v=FC0DuGY1DvM&amp;list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&amp;index=14">talk</a> on Data Change Management, which covered topics like Slim CI, linting, blue/green deployment</li><li>The Montreal Analytics team took ownership of the <a href="https://github.com/Montreal-Analytics/dbt-gloss">dbt-gloss</a> repo, which has a plethora of pre-commit hooks, like test and documentation coverage</li></ul><p>Today’s post is the spiritual sequel to the <a href="https://www.youtube.com/watch?v=LNY0K6mSEEI&amp;t=4s">Coalesce 2021 talk</a> by Kevin Chan and Jonathan Talmi, which has a a companion Discourse post <a href="https://discourse.getdbt.com/t/how-to-add-observability-to-your-dbt-deployment/3451">here</a>. In that talk, we showed how we gained observability over dbt and the underlying data by logging rich metadata from dbt runs into the warehouse and creating alerts and visualizations for individual model and test failures, cost swings, and overall job performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gTZzHxov5bp_EWRZ" /><figcaption>Our dbt alerts originally ran in Looker and tagged Slack user groups linked to business domains</figcaption></figure><p>Over the last two years, our observability system evolved substantially to meet the needs of a fast-growing company. This post will cover these changes and attempt to provide an overview of how we think about observability for dbt.</p><h3>Motivation</h3><p>Since 2021, Super launched two new verticals — <a href="https://www.super.com/shop">SuperShop</a> and <a href="https://www.super.com/cash">SuperCash</a> — and company headcount grew to support them. The number of people interfacing with the data platform as developers or consumers grew by 3x as a result. With more pipelines, models, and tests across the stack, the number of alerts grew. Important alerts were sometimes drowned out by noisy ones, and engineers frequently cited “alert fatigue” as a pain point. Furthermore, teams began organizing themselves by domain <em>and</em> vertical, which necessitated a reimagination of our domain-based notification system (sending alerts to “growth” or “finance” was no longer enough!). Finally, we saw an opportunity to both address these shortcomings and improve observability with the help of a few new open source projects</p><p>Bearing this in mind, we embarked on a project to revamp alerting for the data team. We set up new Slack channels for team and vertical combinations (e.g. data-ecommerce-alerts), and began funneling alerts from across the stack into them. We added new features like individual user tagging, threading for repetitive alerts, and daily <a href="https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/teamclairvoyant/sla-miss-report/sla-miss-report/airflow-sla-miss-report.py">SLA reporting</a> for most data pipelines. This post will be about the updates we made to dbt.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/502/0*XPWaZGcWOMCEN7L9" /><figcaption>Repeat alerts from Airflow DAGs are added in-thread on Slack if they occur on the same day</figcaption></figure><h3>What is “dbt observability”?</h3><p>dbt at its core enables data teams to define and manage data resources (tables, views, etc.) and execute observability workloads (tests, source freshness etc.) on top of those resources.</p><p>dbt commands like run, build, and test intelligently string together resource management and observability workloads into pipelines, which we call jobs.</p><p>There is a distinction between observability for data and observability for dbt. Numerous tools, including dbt itself, improve data observability by helping you validate the expectations you have over the data in your warehouse. <a href="https://docs.getdbt.com/docs/build/tests">dbt tests</a> do this by definition, and numerous <a href="https://github.com/calogica/dbt-expectations">open</a> <a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/">source</a> <a href="https://github.com/elementary-data/dbt-data-reliability">packages</a> have extended the core test suite to support more complex data testing like anomaly detection. This is beyond the scope of this post, but we recommend “<a href="https://www.metaplane.dev/blog/the-four-pillars-of-data-observability">The Four Pillars of Data Observability</a>” by Kevin Hu of Metaplane for a primer on data observability.</p><p>Observability for dbt, by contrast, involves monitoring, alerting, and visibility over dbt jobs themselves. Engineers, analysts, and even business stakeholders need to know when data quality tests fail, models break in production, jobs hit bottlenecks, or costs begin to skyrocket.</p><p>We break this down into two rough categories: Job outcomes and job performance. Job outcomes are the results of jobs, like failures in tests, models, snapshots, and source freshness checks. Job performance refers to metrics like execution time and cost, at both the job and individual resource levels. This post will show how we used dbt artifacts, warehouse logs, and open-source packages to gain visibility into both.</p><h3>Job outcomes</h3><p>The most important concern for dbt observability is ensuring the right people get notified when things go wrong. When the number of models in your dbt project is small, you can live with a single notification per job and a pair of watchful eyes, but as your project grows to thousands of models and tests, it becomes hard for any single individual to supervise. At that point, it may become desirable to get more granular, and trigger an alert for every model or test failure within a job rather than a single one for the whole job. This way, alerts can be distributed to specific subscribers.</p><p>Granular alerting can be accomplished by storing the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">run results</a> and <a href="https://docs.getdbt.com/reference/artifacts/manifest-json">manifest</a> artifacts in your warehouse and querying them using an alerting library or BI tool. The run results artifact includes detailed execution metadata for all models, tests, and snapshots run in a dbt job (source freshness results are only available in the <a href="https://docs.getdbt.com/reference/artifacts/sources-json">sources</a> artifact, <a href="https://github.com/dbt-labs/dbt-core/issues/5609">for now</a>). The manifest includes metadata for each resource in your project, like our custom-defined model owners in the <a href="https://docs.getdbt.com/reference/resource-configs/meta">meta</a> config:</p><pre>models:<br>- name: dim_travel_user<br>  description: &#39;{{ doc(&quot;dim_travel_user&quot;) }}&#39;<br>  meta:<br>    owner: &quot;@jonathan&quot;<br>    channel: &#39;{{ var(&quot;data_travel&quot;) }}&#39;</pre><p>These artifacts are available through dbt’s <a href="https://docs.getdbt.com/reference/dbt-jinja-functions/on-run-end-context#results">Jinja context</a> and can be uploaded to your warehouse using a <a href="https://docs.getdbt.com/reference/resource-configs/pre-hook-post-hook">post-hook</a> at the end of a dbt job. There are a few open source packages like Elementary’s <a href="https://github.com/elementary-data/dbt-data-reliability/blob/master/macros/edr/dbt_artifacts/upload_run_results.sql">dbt_data_reliability</a> and Brooklyn Data’s <a href="https://github.com/brooklyn-data/dbt_artifacts/blob/main/macros/upload_results.sql">dbt_artifacts</a> which have this functionality (we use Elementary).</p><p>After we upload artifacts to the warehouse, we use Elementary’s open source python <a href="https://github.com/elementary-data/elementary">package</a> to send alerts to specific Slack channels tagging model owners. The Elementary alerts contain queries, sample data, and relevant metadata — much richer information than we had before. Here are a few examples:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/660/0*lDFm1FZJf9f3Tj4Q" /><figcaption>Source freshness failure alert triggered by Elementary</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/606/0*Jf2HkdNxklcYPhqk" /><figcaption>Test failures alert triggered by Elementary</figcaption></figure><h3>Performance</h3><p>It’s also important for the data team to collect data on dbt job executions so that we can visualize performance over time. In our original post, we showcased the visualizations we created in Looker that let us investigate a model’s execution over time and drill-down into specific jobs.</p><p>The model execution history helped us identify when models need their warehouses resized, or if we should explore new clustering strategies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/0*lcmGVZHdtKUErEwD" /><figcaption>Model execution history</figcaption></figure><p>The job execution view helped us identify pipeline bottlenecks in long-running jobs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/712/0*zRVyKw72kugjwhtU" /><figcaption>Pipeline bottleneck visualizations for hourly and nightly dbt runs</figcaption></figure><p>With Elementary, we gained access to a rich report with model and test execution history that we host on a static site in AWS. We use this report to investigate ongoing issues at the model and test level.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GTuM0mEmxsD7JxmR" /><figcaption>Model history view in the Elementary observability report</figcaption></figure><h3>Cost per model</h3><p>Back in 2021, we created a fairly primitive cost-per-model metric to help us identify our most expensive dbt models. There is now a much better version available in the open source dbt package <a href="https://github.com/get-select/dbt-snowflake-monitoring">dbt-snowflake-monitoring</a> by <a href="https://select.dev/">Select</a>. This package uses the <a href="https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html">Snowflake query history</a> and <a href="https://docs.snowflake.com/en/sql-reference/account-usage/warehouse_metering_history.html">warehouse metering history</a> views to construct an accurate picture of cost-per-model and cost-per-workload, which we’ve used to build Looker dashboards like the one below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*roNH9lSAULgeqt11" /><figcaption>Most expensive dbt models using the dbt-snowflake-monitoring package</figcaption></figure><h3>Conclusion</h3><p>When you’re running dbt at scale, it is vital to gain observability over your jobs so you can identify and fix data issues before they become large-scale problems. Observability gives your team powerful tools respond to ongoing data quality issues and optimize jobs. We achieved granular observability over dbt using dbt artifacts and open source packages, which allows us drill down to the level of an individual model or test. This fit our needs as a centralized data team at a multi-vertical company with hundreds of models and thousands of tests in a monorepo. Other teams may see more success with different approaches, like a multi-project setup with job-level alerting, which will become substantially <a href="https://github.com/dbt-labs/dbt-core/discussions/5244">easier</a> in the near future. No matter what model you adopt, adding observability to dbt has never been easier, and your team will gain valuable visibility into your project as it grows in complexity.</p><p>If you would like to learn more about employement opportunities at Super — check our <a href="https://www.super.com/careers?utm_source=Medium&amp;utm_medium=dbtatsuper3_blogarticle&amp;utm_content=Jonathan">careers page</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c8755109901f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/super/dbt-at-super-part-3-observability-c8755109901f">dbt at Super Part 3: Observability</a> was originally published in <a href="https://medium.com/super">Super.com</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[dbt at Snapcommerce Part 2: Continuous Integration]]></title>
            <link>https://medium.com/super/dbt-at-snapcommerce-part-2-continuous-integration-260d4e782eba?source=rss-614576d16221------2</link>
            <guid isPermaLink="false">https://medium.com/p/260d4e782eba</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[engineering]]></category>
            <category><![CDATA[startup]]></category>
            <dc:creator><![CDATA[Jonathan Talmi]]></dc:creator>
            <pubDate>Wed, 12 Oct 2022 14:10:29 GMT</pubDate>
            <atom:updated>2022-11-29T18:47:32.853Z</atom:updated>
            <content:encoded><![CDATA[<h3><a href="https://medium.com/snaptravel/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072?source=user_profile---------0----------------------------">dbt at Super Part 2: </a>Continuous Integration</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J5F0k48x0RUGGlW80lVjrA.jpeg" /></figure><h3>Introduction</h3><p>The <a href="https://medium.com/snaptravel/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072">first post</a> in this series covered orchestration for dbt projects, including the concepts of proactive and reactive orchestration. I explained our use cases for both orchestration modes and how we do deploy both as efficiently as possible. It’s been exciting to see orchestration become a conversation topic in the data community over the past few months. Just to highlight a few examples:</p><ul><li>Dagster introduced <a href="https://dagster.io/blog/software-defined-assets">Software-Defined Assets</a> and reconciliation-based orchestration, which facilitates the rematerialization of assets like dbt models based on observed changes in the asset’s metadata and upstream landscape</li><li>dbt introduced the <a href="https://docs.getdbt.com/reference/node-selection/methods#the-source_status-method">source_status</a> selection method to dbt Core, which enables users to trigger proactive dbt pipelines based on status of upstream data</li><li>Benn Stancil wrote a <a href="https://benn.substack.com/p/down-with-the-dag">blog post</a> about how data assets, and their desired freshness, should drive orchestration — and <a href="https://twitter.com/j_schottenstein/status/1558486184686366720?s=20&amp;t=uedcTAPnjucxmFa72Censg">not the other way around</a></li></ul><p>It’s increasingly looking like many of the complexities and redundancies I covered in the first post are being actively worked on and may not be long for this world — which I welcome with open arms!</p><p>This post will “<a href="https://en.wikipedia.org/wiki/Shift-left_testing">shift-left</a>” to the time before the data asset — in this case, a dbt model — gets materialized in production. It will cover Continuous Integration (CI), code linting, and dbt project organization, showcasing the system built at Super (FKA Snapcommerce) to manage the complexity of our dbt project as we scaled to hundreds of models and dozens of users.</p><h3>What is continuous integration?</h3><p>If you’re reading this blog post, there’s a good chance you’ve encountered Continuous Integration (CI) before even if you don’t know it. CI encapsulates a set of practices engineering teams use to facilitate collaboration and rapid iteration for software projects. These practices promote frequent contributions to a shared code repository with automated testing in place to detect errors before they enter production.</p><p>In the context of dbt projects, many standard elements of CI have already been widely adopted by the dbt community, like maintaining a single source code repository, and automated builds using virtual environments and Docker. CI can also be used to automatically run tests and linting checks whenever dbt code is pushed to the repository and before it gets merged into main.</p><p>For example, let’s say your company wants to start incorporating revenue from commissions into its net revenue calculation. You make a change in the mart_financial_metricsdbt model, which shows up on Github like this:</p><pre>- sale_price - supplier_cost                        AS net_revenue,<br>+ sale_price - supplier_cost + (0.05 * sales_price) AS net_revenue,</pre><p>Your code change is reviewed but you don’t have a CI process set up, so you merge it in after approval. The new codebase gets picked up automatically by an orchestrator like dbt Cloud, which starts running your updated model. Suddenly, you see alerts firing in the #dbt-alerts channel. You check the logs, and see the mart_financial_metrics model failing due to a compilation error. You double-check the code and discover that you accidentally wrote sales_price instead of sale_price when adding commission revenue, leading to the error. You quickly submit a fix.</p><p>This scenario, and the associated stress, can be prevented with a relatively simple CI process that runs and/or tests the changed dbt model. Other small misconfigurations, like a YAML input error, can be detected even earlier by simply compiling the dbt project. Finally, more advanced tests, tools, and techniques can be used to assess <a href="https://en.wikipedia.org/wiki/Software_regression">regressions</a> which are harder to detect, like downstream impacts of a relatively innocuous change in business</p><h3>Continuous Integration for dbt projects</h3><p>What are we trying to do in CI for dbt projects? The answer is not as obvious as you might think. I’ve seen a multitude of approaches across to running and testing dbt code before it enters production across the industry. Here are a few examples:</p><ul><li>Some simply <a href="https://docs.getdbt.com/reference/commands/compile">compile</a> their dbt project, and if successful, deploy it to a staging environment for manual testing and QA.</li><li>Some build and test modified models into a temporary environment using an approach called <a href="https://docs.getdbt.com/guides/legacy/best-practices#run-only-modified-models-to-test-changes-slim-ci">slim CI</a>.</li><li>Some (like the Shopify data team), don’t test <em>data</em> in CI, but test the model logic alone through <a href="https://shopify.engineering/build-production-grade-workflow-sql-modelling#:~:text=data%20discovery%20workflow.-,More%20Tests,-dbt%20has%20native">unit testing</a>.</li></ul><p>For this post, I will focus on the dbt jobs, python scripts, and data warehouse configurations we use to accomplish our version of a robust CI process. I will first talk about testing, and cover the different types of tests we use in our dbt project (data tests and unit tests) and the ones we enforce on every model. I’ll discuss linting, and how to enforce a SQL style guide in your repo. Finally, I’ll discuss project level checks which ensure our data model and project structure kept intact with every new addition.</p><h3>Compile</h3><p>Even the most simplistic CI process must fulfill this basic requirement: Ensure the dbt project compiles! Projects fail to compile when there is misconfigured YAML or Jinja, like references to a model, source, or macro that doesn’t exist. If your project fails to compile, it won’t run in your orchestrator, so it’s imperative you ensure compilation before any new code gets merged in. This can be checked using:</p><pre>dbt ls &gt; /dev/null      # option 1<br>dbt --warn-error parse  # option 2</pre><p>The <a href="https://docs.getdbt.com/reference/global-configs#warnings-as-errors">warn-error</a> config is optional, and will raise an error anytime dbt would normally warn. Warnings are raised in when there are YAML configs for nodes that doesn’t exist, so including this flag helps keep your dbt project nice and tidy .</p><p>Compilation errors are surfaced during run, test, or build,but we run the above commands early in our CI pipelines to fail fast when there is a simple compilation issue. Specifically we run dbt —-warn-error parsein our lint stage, and the following in our test stage:</p><p>dbt --warn-error ls --select state:modified+ --resource-type model</p><p>This helps developers quickly identify the models that will be built in this CI pipeline in case they forgot to include new models in their commits.</p><h3>Run and test dbt models</h3><p>A central tenet of continuous integration is tests should run in a clone of the production environment. This ensures all possible bugs that <em>could</em> occur in production are uncovered and fixed in the test environment. Luckily for us, cloud data warehouses make it easy to replicate production into an infinite number of CI pipelines, and dbt makes it even easier with Slim CI.</p><p>The most important part of your CI process after checking compilation is building and testing your new and modified models in a test environment in your warehouse. Building models ensure that whatever SQL you’ve added or augmented can compile and is free of syntax errors. Testing these models ensures that your new and modified models all satisfy the data quality or business logic constraints you’ve set for them. We run the following dbt job in CI:</p><p>dbt build -s state:modified+ --full-refresh --defer</p><p>This command builds and tests all modified seeds, models and snapshots in order of their position in the dbt DAG, which has the nice feature of failing the pipeline early when there is a test failure.</p><p>When a model has a lot of <a href="https://docs.getdbt.com/reference/node-selection/graph-operators#the-plus-operator">children</a>, CI pipelines can take a long time. We use a few <a href="https://discourse.getdbt.com/t/how-we-treat-big-data-models-in-our-dbt-setup/704">techniques</a> to speed up development pipelines, but some organizations <a href="https://discourse.getdbt.com/t/how-we-sped-up-our-ci-runs-by-10x-using-slim-ci/2603#where-to-from-here-6">limit the scope</a> of the state:modified+ selector to only include immediate children.</p><h4>Test coverage</h4><p>Testing for dbt projects can be roughly bucketed into two categories:</p><ul><li><strong>Data testing</strong>: Tests on top of dbt models built using production data, typically investigating data quality like uniqueness, nullability, and consistency.</li><li><strong>Unit testing</strong>: Tests evaluating model logic by running mock input data through the model and comparing it against expected output.</li></ul><p>Both of these types of tests leverage dbt’s built-in <a href="https://docs.getdbt.com/docs/building-a-dbt-project/tests">dbt testing</a> framework, but only data testing is natively supported by dbt-core, <a href="https://github.com/dbt-labs/dbt-core/discussions/4455">for now</a>.</p><p>Over the last year or so, there has been an explosion in <a href="https://github.com/josephmachado/unit_test_dbt">unit</a> <a href="https://github.com/EqualExperts/dbt-unit-testing">testing</a> <a href="https://github.com/dm03514/dbt-model-tests">frameworks</a> for dbt, signalling the desire of the community to adopt a more conventional and robust approach to testing dbt models. At Super, we adopted unit testing using the <a href="https://github.com/mjirv/dbt-datamocktool">dbt-datamocktool</a> package by Michael Irvine when our models became complex enough to render manual testing tedious and error-prone.</p><p>To ensure the models in our dbt project are well-tested, we require each model to possess at least one unique test and one not null test, but developers are encouraged to add more tests, including unit tests, to fit their needs. We enforce this using a custom python script, but can also be accomplished using the <a href="https://github.com/dbt-labs/dbt-project-evaluator">dbt-project-evaluator</a> package, which will be covered later.</p><h4>Model configs</h4><p>When making updates to lots of models, or sometimes even full directories, it can be difficult to ensure that all configs remain the same. We run a small python script to print out the diffs in configs for changed models, which makes it easy to identify when materializations, tags, or other configs have been changed errantly:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/719/1*j19PhSDijnK118NiKVxIvg.png" /></figure><h3>Code linting</h3><p><a href="https://en.wikipedia.org/wiki/Lint_(software)">Linting</a> for software projects is a process that uses static program analysis to flag stylistic errors, improper conventions, and bugs. For dbt projects, it’s used to ensure your SQL (and now <a href="https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models">python</a>) code follows a consistent style guide. We use <a href="https://www.sqlfluff.com/">SQLFluff</a> and a custom ruleset to enforce common conventions across our SQL codebase (and in the future, we may adopt something like <a href="https://github.com/psf/black">black</a> for Python models). We execute SQLfluff on new models only, using the following <a href="https://docs.sqlfluff.com/en/stable/production.html#adding-diff-quality-to-your-builds">command</a>:</p><pre>diff-quality --violations sqlfluff --compare-branch=main --fail-under=100</pre><h3>Documentation and YAML</h3><p>CI processes can also be used to apply dbt project standards as defined by the project owners. At Super, we value robust documentation, so we use CI to ensure that all new models, snapshots, sources, and columns are documented</p><p>All new models, sources, and snapshots must have the following:</p><ul><li>Node description</li><li>Column descriptions</li><li>Unique and not null tests (models and snapshots only)</li><li>Freshness condition (sources only)</li><li>Owner (slack handle)</li><li>Channel (slack channel)</li></ul><p>The last two are added as meta fields and are used by our observability system to distribute alerts:</p><pre>meta:<br>    owner: &quot;@jonathan&quot;<br>    channel: &quot;ops-alerts</pre><h3>Project and DAG linting</h3><p>Many organizations that have implemented dbt without strongly enforcing their project structure have come to know the pain of the “spaghetti DAG”:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cVDHLkEtm5wKR861.jpg" /><figcaption>Image taken from dbt Labs’ “<a href="https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/">Data modeling techniques for more modularity</a>”</figcaption></figure><p>In the spaghetti DAG, models are layered on top of each other without consideration or consolidation, leading to extremely wide DAGs with incoherent model relationships. Not only is this bad for analytics engineers’ sanity, but it’s bad for performance and consistency, as the same logic is duplicated across the project leading to unnecessary compute and business logic drift. A great primer on how to adopt a modular approach to your dbt project can be found <a href="https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique/">here</a> and <a href="https://courses.getdbt.com/courses/refactoring-sql-for-modularity">here</a>.</p><p>CI can play an important role in preventing this spaghetti DAG from forming in the first place. Rules can be defined and applied in CI to block violations from making their way into the codebase, keeping project structure intact. For example, it’s <a href="https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging#staging-models">best practice</a> to create a single staging model for every source — a script can see if this relationship holds by parsing the <a href="https://docs.getdbt.com/reference/artifacts/manifest-json">manifest</a> associated with the proposed code change in CI.</p><p>At Super, we run a few relationship checks on every commit to make sure the pull request conforms to our project standards. These include:</p><ul><li>1:1 relationship of source to stg model</li><li>fct/dim models only depend on stg models</li><li>mart models only depend on fct/dim models</li></ul><p>While a few <a href="https://github.com/autotraderuk/oliver-twist">early</a> <a href="https://github.com/offbi/pre-commit-dbt">projects</a> attempted to create tools for this, dbt Labs’ recently released <a href="https://github.com/dbt-labs/dbt-project-evaluator">dbt-project-evaluator</a> is the most advanced open source package yet, with a plethora of rules to cover even complex use cases like rejoins and model fanouts.</p><h3>Auditing and validation</h3><p>The above checks help ensure that code is clean, readable, well-tested and well-documented before it gets merged in, but the developer still needs to validate the raw business logic they are adding or modifying. QA can be a tedious process especially when an existing model is being changed. Unit tests can be added to affected models, but they are fairly cumbersome to write and don’t capture downstream implications!</p><p>A thread recently cropped up on Twitter detailing this exact problem, which led to a bunch of interesting responses:</p><h3>pedram growth hacker on Twitter: &quot;What checks do you do after building/changing a model? Apart from your standard dbt PK / not null tests, how do you validate what you built? We cant all just be manually going through the table and scrolling like hmm...yes...looks right to me..can we? / Twitter&quot;</h3><p>What checks do you do after building/changing a model? Apart from your standard dbt PK / not null tests, how do you validate what you built? We cant all just be manually going through the table and scrolling like hmm...yes...looks right to me..can we?</p><p>Many repliers talked about auditing the results with varying levels of sophistication. Some check primary keys and row counts, while <a href="https://sarahsnewsletter.substack.com/">Sarah Krasnik</a> wrote:</p><blockquote>Any column that’s changed, I do before/after stats if it’s numeric like mean/median/max/min and make sure the difference is warranted. If it’s a string but has &lt;10 unique values, do counts for all the values and again make sure the change is warranted. Document all in the PR!</blockquote><p>At Super, we use a paid tool called <a href="https://datafold.com/">Datafold</a> to automate much of the QA process. Datafold runs a detailed diff for every dbt model (and child model) touched in a pull request. It appends a merge comment to the PR with a preview of the full results, like the one seen below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WsguaNnbEGirb4ur" /></figure><p>Datafold makes it easy for developers to investigate the final output of their models and metrics, and can be a powerful tool for change management since it compares new versions of models with the existing production ones. However, if you are looking for free or open-source options, the <a href="https://github.com/dbt-labs/dbt-audit-helper">dbt audit helper package</a> is a great alternative. This package contains macros which perform diffs between models and tables, and it can be run manually or hacked into CI using some creativity.</p><h3>Lineage</h3><p>One of the most important aspects of change management for dbt is managing all the downstream implications outside of dbt. When someone makes a pull request to our dbt repo, they should be able to easily identify all the downstream assets that might be affected, like dashboards, visualizations, and reverse ETL syncs. <a href="https://docs.getdbt.com/docs/building-a-dbt-project/exposures">Exposures</a> were introduced into dbt for this exact reason, but creating and maintaining them requires a significant amount of manual work.</p><p>At Super, we don’t have an automated solution for this. We maintain a list of repos and services developers should check so they don’t break anything with a dbt model update. There are signs vendors are beginning to take note, as shown by the recent <a href="https://www.datafold.com/blog/hightouch-integration">partnership</a> between Datafold and Hightouch, which surfaces downstream Hightouch syncs in Datafold’s pull request comments — but we would welcome an open source solution to this problem too.</p><h3>Conclusion</h3><p>In this post, we broke down our CI process for dbt projects. In order to keep our codebase clean and bug-free, we run and test all new or modified models, enforce test and documentation coverage, lint the SQL code, check that the new code is consistent with our project structure, and finally surface data quality checks and diffs to facilitate QA.</p><p>If you want to join the Super team — check out our<strong> </strong><a href="https://www.snapcommerce.com/careers?utm_source=Medium&amp;utm_medium=data_blogarticle&amp;utm_content=Jonathan"><strong>careers page</strong></a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=260d4e782eba" width="1" height="1" alt=""><hr><p><a href="https://medium.com/super/dbt-at-snapcommerce-part-2-continuous-integration-260d4e782eba">dbt at Snapcommerce Part 2: Continuous Integration</a> was originally published in <a href="https://medium.com/super">Super.com</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[dbt at Super Part 1: Orchestration]]></title>
            <link>https://medium.com/super/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072?source=rss-614576d16221------2</link>
            <guid isPermaLink="false">https://medium.com/p/964c9a87b072</guid>
            <category><![CDATA[analytics]]></category>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[orchestration]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[engineering]]></category>
            <dc:creator><![CDATA[Jonathan Talmi]]></dc:creator>
            <pubDate>Wed, 20 Apr 2022 19:05:47 GMT</pubDate>
            <atom:updated>2023-01-23T22:40:38.474Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J5F0k48x0RUGGlW80lVjrA.jpeg" /></figure><h3>Introduction</h3><p>dbt has been an integral part of the data platform at <a href="https://www.super.com/">Super</a> since August 2018. In that time, roughly 500 models, 2500 tests, and 200 sources have been added to our dbt project, mostly by the data team, although folks outside the data team, like software engineers and technical account managers, routinely contribute as well. The goal of the data platform team, which maintains dbt as a service, is to make it as easy as possible to develop and deploy dbt models that get refreshed at the desired frequency.</p><h3>Orchestration for dbt</h3><p><a href="https://www.astronomer.io/blog/what-is-data-orchestration/">Data orchestration</a> is an “automated process that takes data from multiple storage locations and programmatically allows you to author, schedule, and monitor data pipelines.” In the context of dbt, orchestration refers to the system or process that automatically builds, snapshots, and tests dbt objects like sources and models. Orchestration for dbt typically occurs in one of two ways: <strong>proactive</strong> and <strong>reactive</strong> (thanks to <a href="https://benn.substack.com/p/the-end-of-big-data?s=r#:~:text=reactive%20version%20of%20dbt">Benn Stancil</a> for inspiring this terminology).</p><h3>Proactive orchestration</h3><p>Proactive dbt jobs are scheduled at fixed intervals, like once an hour or every night at 2:00 AM. You will use proactive jobs when you want models and snapshots to update at predictable times regardless of the state of the upstream data. Tools like dbt Cloud and Airflow facilitate the creation of proactive dbt jobs by allowing users to define schedules in UI or in code.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*v08V56IJuE_VRm2D" /></figure><p>Proactive orchestration is simple and effective. Most models do not need to be run more than once per day, and those that do (to include same-day data) can be included in hourly (or every X hour) runs. Occasionally, however, you may want to build models outside of a fixed schedule, which leads one to reactive orchestration.</p><h3>Reactive orchestration</h3><p>Reactive dbt orchestration refers to when dbt jobs are triggered immediately after new data gets loaded into the warehouse. This may be required when a dbt job is part of a larger data pipeline, or when stakeholders want to consume the latest data available.</p><p>Let’s look at an example. Let’s say there is a marketing report from an external partner that gets updated once per day. The report gets copied from source and loaded into a staging area in the data warehouse in some ELT process. dbt then plays a vital role in transforming this marketing report into something usable before it gets consumed by business stakeholders or ML models. Timeliness is important, as both stakeholders and algorithms both perform best with the latest data available.</p><p>In the above case, it is desirable to run downstream dbt models as soon as data gets loaded into the warehouse. The dbt job is part of a larger <a href="https://www.mozartdata.com/post/etl-vs-elt-what-s-the-difference">ELT(L)</a> pipeline where data is ingested, transformed, and then immediately <a href="https://hightouch.io/blog/what-is-data-activation/">activated</a> at the end of the pipeline (e.g. using reverse ETL). Proactive orchestration, in this case, isn’t very appealing. The marketing report dbt models could be part of the hourly dbt job, but that would necessitate building them 24 times a day even though new data only arrives once. Furthermore, schedules would have to be synced across tools (e.g. the data loader, reverse ETL process, etc.) to minimize the lag between receiving data and activating it.</p><p>Reactive orchestration is possible using both dbt Core and <a href="https://registry.astronomer.io/providers/dbt-cloud">dbt Cloud</a>. In both cases, EL pipelines can be paired with dbt jobs using orchestration tools like Airflow, Prefect, or Dagster, through API calls to dbt Cloud, and in the future, through an API call directly to a self-hosted <a href="https://www.youtube.com/watch?v=MdSMSbQxnO0">dbt server</a>. Fivetran’s <a href="https://fivetran.com/docs/transformations/dbt">Transformations</a> support reactive orchestration natively. Fivetran lets users define key models to keep fresh, and these models are rebuilt (and optionally tested) whenever an upstream data source loads new data (although this i limited to Fivetran-enabled data sources).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/601/0*Kht1gcGccUuGdwr3" /></figure><p>In sum, there are two main modes of dbt orchestration:</p><p><strong>Proactive</strong>: Run a set of models at a fixed interval, like once an hour, e.g. :</p><pre>dbt run -s tag:hourly</pre><p><strong>Reactive</strong>: Run a set of models after a data source is updated, e.g.:</p><pre>dbt run -s source:mysource.table+</pre><h3>Hybrid approaches</h3><p>Many organizations combine proactive and reactive modes of orchestration, and increasingly, dbt Core and dbt Cloud are adding new features to support hybrid orchestration.</p><p>After initially posting this article, Austin Lin Gibbons replied on the dbt slack community with a use case for combining proactive and reactive jobs for specific models. Organizations may want their daily and intraday dbt jobs to be reactive to include the latest data available, while running proactive full refreshes on a weekly or monthly cadence to account for changing historical data. Austin used the example of GDPR compliance, which may require companies to purge data from derivative tables at least once every 30 days. Frequent proactive full refreshes of incremental models also address the problem of <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/late-arriving-fact/#:~:text=A%20fact%20row%20is%20late,late%20arriving%20measurement%20event%20occurred.">late-arriving facts</a> or <a href="https://medium.com/dataseries/resolving-late-arriving-dimensions-c0ebc9f818c3">dimensions</a>.</p><p>Furthermore, dbt Core is adding a <a href="https://docs.getdbt.com/docs/guides/best-practices/#run-only-modified-models-to-test-changes-slim-ci">new feature</a> in 1.1 which allows users to create proactive jobs that are intrinsically reactive. Using state comparison and a brand new selector, dbt will only run models downstream of <em>fresh</em> data, rather than running models repeatedly using stale data:</p><pre>dbt source freshness<br>dbt build --select source_status:fresher+ --state path/to/prod/artifacts</pre><p>This is supported <a href="https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration#fresh-rebuilds">natively in dbt Cloud</a>, which automatically handles state comparison and artifacts storage.</p><h3>Early dbt orchestration at Super</h3><p>In the early days of dbt at Super, we did not require advanced orchestration, so all our models, tests, and snapshots were run using hourly or nightly dbt jobs. We used dbt tags to delineate hourly and nightly models, and we ran checks in CI to ensure every model has only one of these tags. We set up jobs in dbt Cloud like the following:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QHrYevexy91eSVGA" /></figure><p>Let’s look at growth analytics for <a href="https://www.snaptravel.com/">Snaptravel</a>, our hotels product, as an example. At any given point in the company’s recent history, we’ve used 5–10 marketing channels to drive customer acquisition. Granular data on hotel and itinerary-level marketing performance is made available to us by our partners daily at different times depending on the channel. Originally, we had 10–15 pipelines defined in Airflow that fetched data, copied it into S3 and loaded it into our data warehouse, Snowflake. These pipelines looked something like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zmEs5GghDrLlDnoC" /></figure><p>Before dbt, this data was combined with our internal transactional data directly in the BI layer to calculate important metrics like ROAS. After dbt, these reports were modelled and consolidated into an aggregated fact table. Even though each model only needed to be updated once per day, we included them in our hourly dbt runs since the exact timing was often unpredictable — reports could come in any time between 4–7 AM, for example — and the size of the data at the time was small enough to not substantially elongate our dbt runs.</p><p>Most organizations, like Super in the early days, can probably survive using proactive orchestration and small set of scheduled jobs like hourly and nightly runs. However, as the use cases for data pile up and the data itself grows in size and surface area, it’s almost inevitable that you will have to consider alternative approaches.</p><h3>Adding reactive orchestration</h3><p>As head count at Super grew, our growth analytics function became more sophisticated. The team’s responsibilities grew to include:</p><ul><li><strong>Reporting and analytics</strong>. Tracking ROAS at the channel level on a daily basis. There was significant drop-off for some of our properties — let’s find out why!</li><li><strong>Automated bidding</strong>. Updating bids and bid multipliers for channels using the latest available performance data. We’re getting outbid on long-term stays (which are high AOV) in New York because our CPCs are too low — let’s raise them!</li><li><strong>Dynamic pricing</strong>: Raising or lowering prices based on our price competitiveness compared to other OTAs. Our price for weekend stays at the Bellagio in Las Vegas is 5% higher than competitors — let’s lower it!</li></ul><p>All three require combining multiple data sources together and defining advanced metrics dbt. Timeliness became particularly important especially in the case where data finds its way back into the product through automated bidding or dynamic pricing.</p><p>You can probably tell where this is going. Reactive dbt orchestration was a natural fit for activating marketing data as soon as it arrived. Triggering dbt jobs inside the data pipelines that manage extraction, loading, and reverse ETL, meant that new bids, prices, and reports could be published immediately upon ingesting new data. A data pipeline might look something like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*iqgJssOqydfwBYIr" /></figure><p>The dbt models selected in steps 3 and 4 are the downstream models from that source:</p><pre>source:mysource.table+</pre><p>We also exclude models tagged hourly or nightly to avoid race conditions or data issues, such as in the case where incremental models are not properly defined. We added the “external” deployment tag for models exclusively managed through reactive dbt jobs. We then limited node selection to the <a href="https://docs.getdbt.com/reference/node-selection/set-operators#intersections">intersection</a> between the source’s children and the external tag, like so:</p><pre>dbt run -s source:mysource.table+,tag:external</pre><p>Now, analytics engineers and developers have a few orchestration options to choose from when adding new dbt models. The simplest option is using proactive orchestration, where they tag their models hourly or nightly, depending on how important it is to look at same-day data. For more complex use cases, they use reactive orchestration, by tagging models as external and adding dbt tasks directly into an Airflow DAG, like so:</p><pre>dbt_run = KubernetesPodOperator(<br>    **super_defaults,<br>    **dbt_defaults,<br>    task_id=’dbt_run_task’,<br>    name=”dbt-run-task”,<br>    arguments=[“dbt run -s source:mysource+,tag:external”]<br>    )</pre><h3>Conclusion</h3><p>In summary, Super uses both proactive and reactive orchestration in its dbt deployment. We used proactive orchestration exclusively at first since all our requirements could be satisfied by hourly and nightly dbt jobs. As the data grew in size and surface area, and operationalizing data became an important use case, we introduced reactive orchestration which permitted timely refreshes of dbt models before loading the results to new destinations. Contributors to our data platform now have several options for orchestrating dbt, and we ensure there is no overlap between proactive and reactive dbt models by checking for tag collisions in CI.</p><p>If you want to read more — check out and follow our <a href="https://medium.com/super">Super publication</a> here on Medium.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=964c9a87b072" width="1" height="1" alt=""><hr><p><a href="https://medium.com/super/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072">dbt at Super Part 1: Orchestration</a> was originally published in <a href="https://medium.com/super">Super.com</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>