Building a semantic layer in Preset (Superset) with dbt

Published in

Plum-living

5 min readFeb 15, 2023

Plum is an interior design & decoration start-up. At Plum, we unleash our customers’ creative potential, by providing them with meaningful and inspirational DIY and interior design content, top-of-the-line products, and expert advice and guidance. Ultimately, we want to be the creative partner of choice when it comes to interior design.

For that, our team needs to perfectly know our consumers & community behaviors, our suppliers data… it means synchronize data from different tools, and define the data & metrics we want to expose to each team.

To kick off our Data Platform creation process, we had to benchmark the Data Visualization solutions. Preset (a fully-managed service for Apache Superset) was a good match, with a balance between simplicity and versatility.

Amongst all the features offered by Preset, the synchronization feature with dbt grabbed our attention.

Another reason we favored Preset was its open-source tool, which avoided locking Plum into a proprietary system. The synchronization with dbt would take us one-step further, allowing us to define our entire exposed data model not only by code but also with a solution that was totally independent of the visualization platform.

This article will show how we implemented this project technically, and the advantages it provided Plum.

What is a metric definition ?

A metric is defined by dbt as “an aggregation over a table that supports zero or more dimensions”. More generally, it’s a representation of one indicator over a dataset, linked to one or many company KPIs.

Creating metrics in the modelling layer is useful as it gives us a single definition of an indicator understandable across business teams. We could say it’s a first step of building a semantic data layer.

For example we could provide a metric which computes the revenue, taking into account all necessary fields :

  - name: total_revenue_vat_included
    label: Total revenue VAT included
    model: ref('m_orders')
    description: "The total revenue (VAT included). It filters out cancelled and closed orders"

    calculation_method: sum
    expression: total_after_discount_vat_included

    timestamp: transaction_date
    time_grains: [day, week, month, quarter, year, all_time]

    filters:
      - field: status
        operator: '!='
        value: "'cancelled'"

How it works

Here are the two main functionalities of this system:

Whenever a developer pushes a new metric definition in dbt, Preset is updated with this metric definition
Whenever a developer updates a table schema, Preset dataset definitions are updated

This allows us to define metrics & schemas by code and always maintain Preset in sync with the dbt configuration. It also allows us to expose only selected datasets to the users, while keeping them in the freshest state, and hide raw & cumbersome utility tables under-the-hood.

We use a dbt-core project for this, hosted on a Github repository.

Here is the architecture structure of the synchronisation from Github to Preset:

We rely on Github actions to push the updated metrics & schema on the appropriate workspace every time a developer pushes something on specific branches.

Basically:

pushing on the development branch updates the development workspace of our Preset account
pushing on the main branch updates the production workspaces of our Preset account.

This schema sums up the architecture behind Preset workspaces:

Focus: CI configuration

The CI configuration has been greatly inspired by this article from Preset: https://preset.io/blog/dbt-superset-integration-preset/

It relies on the Preset CLI (which works over Superset API) which provides the synchronization utility: https://github.com/preset-io/backend-sdk

The necessary steps:

Retrieve the Preset CLI repository (you can also host it internally and/or create a package) + Install dbt
Execute a dbt compile command to generate the manifest.json file. Preset CLI indeed has to parse the dbt manifest to retrieve the structure of your models.
Publish the resources you need with the CLI!

The idea is to select only the models/tables you want to expose to your users (put raw tables aside for example & only show datamarts). Metrics defined in dbt will automatically be accessible in Preset as metrics linked to your tables.

Here is a Github Actions file example for this process:

name: 'Publish dbt resources to Preset'

env:
  DBT_GOOGLE_PROJECT: xx

on:
  push:
    branches:
    - dev
    - main

jobs:
  publish:
    name: 'Publish resources'
    runs-on: ubuntu-latest

    defaults:
      run:
        shell: bash

    steps:
    - name: 'Checkout CLI'
      uses: actions/checkout@v3
      with:
        repository: preset-io/backend-sdk
        path: preset-cli

    - name: 'Install CLI'
      run: |
        cd preset-cli && pip install . && cd ..

    - name: 'Checkout'
      uses: actions/checkout@v3

    - name: 'Install dbt'
      run: |
        pip install dbt-core==1.3.1 dbt-bigquery==1.3.0

    - name: 'Define variables based on branch'
      run: |
        if [[ $GITHUB_REF == 'refs/heads/main' ]]; then
            echo "WORKSPACE=yy,zz" >> "$GITHUB_ENV"
            echo "DBT_TARGET=prod" >> "$GITHUB_ENV"
        else
            echo "WORKSPACE=zz" >> "$GITHUB_ENV"
            echo "DBT_TARGET=dev" >> "$GITHUB_ENV"
        fi

    - name: 'Compile manifest'
      run: |
         dbt deps && compile --target=${DBT_TARGET}
      env:
        GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.DBT_GOOGLE_BIGQUERY_KEYPATH }}
      working-directory: dbt-models/plum_core_data

    - name: 'Publish refined resources'
      env:
        PRESET_API_TOKEN: ${{ secrets.PRESET_API_TOKEN }}
        PRESET_API_SECRET: ${{ secrets.PRESET_API_SECRET }}
      run: |
        preset-cli --workspaces=${WORKSPACE} superset sync dbt \
          ./target/manifest.json \
          --profiles=./profiles-ci.yml \
          --project=plum_core_data --target=${DBT_TARGET} --import-db -s tag:refined tag:datamart -x tag:ephemeral
      if: github.event_name != 'pull_request'
      working-directory: dbt-models/plum_core_data

Note the synchronization tool is also able to retrieve dashboards created in Preset as dbt exposures.
Implementing this (not necessarily in the CI for this part) would greatly help to backup dashboards and/or exposing them to a Data Catalog for example.

Conclusion

This Preset feature allows us to define the data & metrics we want to expose to our users “as code” with dbt, without being tied to a closed solution or language specific to the visualization tool.

Something is still missing though: how to collaborate with business users in the definition of the metrics. Editing configuration files in a Github repository is not the best for end users. I’m quite confident this part will be improved very soon.

In a future article, I’ll detail a bit the architecture of the Data Platform we started to create 4 months ago. Stay tuned!

Building a semantic layer in Preset (Superset) with dbt

What is a metric definition ?

How it works

Focus: CI configuration

Conclusion

Written by Pierre Vanacker