Understanding the Semantic Layer

Leveraging the Syntax of dbt Labs’s MetricFlow

Denzel S. Williams
The Data Driven Diaries
12 min readSep 6, 2023

--

Trying to Define the Semantic Layer

Engineers, mathematicians, and scientists are usually good at naming things. They often include the meaning of something in its name. To get the idea of the Semantic Layer, we should first understand what ‘Semantics’ means.

A Compilation of Definitions across the Web

  • Semantics is the study of reference, meaning, or truth. Relating to meaning in language or logic. Of, relating to, or arising from the different meanings of words or other symbols.
  • Semantics is a branch of linguistics concerned with meaning: what are the meanings of different words? How are those meanings constructed and how do they change? These are the kinds of questions that linguists seek to answer.
  • An example of semantics in everyday life might be someone who says that they’ve bought a new car, only for the car to turn out to be second-hand. However, the person feels that the car is new for them, creating a semantic ambiguity.

Mashing All of Those Definitions into One

The Semantic Layer pertains to the reference, meaning, or truth of data. It seeks to understand how different data items are defined, how their meanings are constructed, and how they might evolve. One of its primary objectives is to avoid semantic ambiguities. For instance, when referencing 1000 new subscriptions in May, the Semantic Layer would clarify whether these are first-time subscriptions or reactivations.

What’s the Point of a Semantic Layer (from dbt Labs)

Excerpt from: Why we need a Universal Semantic Layer

To really make the most out of data, we need to have a consistent, unified, and understandable view of it, one that is easily accessible to both technical and non-technical stakeholders. More often than not, businesses face hurdles with this.

Data consumers who don’t have context on the underlying data need to put in a lot of work to understand the nuances of data sets, and finding this information is rarely easy. The interfaces (SQL, BI Tools) can be prohibitively complicated to interact with for users without modeling experience. This creates a new requirement (and a bottleneck) for meaningful analysis to be done: a data producer to act as a translation layer.

It’s also no surprise that different teams across a company use different tools, and when critical metric calculations lead to different answers depending on where they are, it’s easy to get caught up debating whose version of reality is right. These inconsistencies ultimately lead to distrust in data.

What’s the Point of a Semantic Layer (from Benn Stancil)

Excerpt from: The missing piece of the modern data stack

…self-serve is a misunderstood (or, at least, misrepresented) problem. Because the most common question people have is “How often did this thing happen?,” effective self-serve is less about complex analysis and more about metric extraction. People “want to choose from a list of understood KPIs, apply it to a filtered set of records, and aggregate it by a particular dimension. It’s analytical Mad Libs — show me average order size for orders that used gift cards by month.

Today’s current stack makes it easy to answer this question, but really hard to answer it consistently. The core problem is that there’s no central repository for defining a metric. Without that, metric formulas are scattered across tools, buried in hidden dashboards, and recreated, rewritten, and reused with no oversight or guidance…

So, if our goal is to define a set of metrics for the business to reuse, how do we achieve that? We’ll tackle this using dbt Labs’s MetricFlow. But remember, we’re using MetricFlow mainly to provide a clear syntax that illustrates a broader concept.

Entities, Dimensions, and Measures

The Semantic Layer is built on three basic concepts for creating a metric:

  • Entity: This is what the data is about (Noun). For example, a student.
  • Dimension: This describes or categorizes the data (Adjective). Like a student’s major.
  • Measure: This is the data or metric you’re looking at (Verb). Such as the average GPA of students.

In this section, I’ll explore semantic modeling using a university database example. Below is a table named fct_class_enrollment created in dbt.


+---------------+------------+-----------+-------------+-----------------------------+-------------------+--------------+-------------------------+
| enrollment_id | student_id | course_id | term | course_name | department | student_year | student_major |
+---------------+------------+-----------+-------------+-----------------------------+-------------------+--------------+-------------------------+
| 1 | 1001 | 501 | Fall 2023 | Calculus I | Mathematics | Freshman | Engineering |
| 2 | 1002 | 502 | Fall 2023 | English Literature | Arts & Humanities | Sophomore | English |
| 3 | 1003 | 501 | Fall 2023 | Calculus I | Mathematics | Junior | Mathematics |
| 4 | 1004 | 503 | Fall 2023 | Introduction to Biology | Sciences | Senior | Biology |
| 5 | 1005 | 504 | Spring 2024 | History of Europe | History | Freshman | History |
| 6 | 1006 | 501 | Spring 2024 | Calculus I | Mathematics | Senior | Engineering |
| 7 | 1007 | 505 | Fall 2023 | Philosophy 101 | Arts & Humanities | Freshman | Philosophy |
| 8 | 1008 | 506 | Spring 2024 | Chemistry Basics | Sciences | Junior | Chemistry |
| 9 | 1009 | 507 | Fall 2023 | Introduction to Programming | Computer Science | Sophomore | Computer Science |
| 10 | 1010 | 508 | Fall 2023 | Business Ethics | Business | Freshman | Business Administration |
| 11 | 1011 | 503 | Fall 2023 | Introduction to Biology | Sciences | Freshman | Biology |
| 12 | 1012 | 503 | Fall 2023 | Introduction to Biology | Sciences | Junior | Mathematics |
+---------------+------------+-----------+-------------+-----------------------------+-------------------+--------------+-------------------------+

Entities

An entity is a thing that exists independently, and which can be identified in a particular context. In database terms, entities typically correspond to tables. They are the main subjects of our data — the things we are collecting data about. Entities are real-world concepts in a business such as customers, transactions, and ad campaigns. We represent entities in our semantic models using id columns that serve as join keys to other semantic models in your semantic graph.

We use entities to help the semantic engine recognize how different tables or sets of data are related. This way, the engine understands how data links together. When someone asks a question about a particular entity, the engine knows where to get the relevant information.

For a university database, potential entities might be “Student”, “Course”, “Instructor”, and “Department”. Keep in mind that a single table can contain more than one entity. Here’s an outline of the entities for our table in this model:

semantic_models:
- name: class_enrollment
model: ref('fct_class_enrollment') # A model created with dbt
description: >
Class Enrollment fact table at the course level.
This table contains one row for each student-course enrollment
and includes the academic term.

entities:
- name: enrollment
type: primary
expr: enrollment_id # Each enrollment is identified by the id
- name: student
type: foreign
expr: student_id # This is the foreign key to the students table
- name: course
type: foreign
expr: course_id # This is the foregin key to the courses table

Dimensions

Dimensions give context to measures. They’re often categories and help break down data in different ways during an analysis. Dimensions typically describe the characteristics associated with other elements in the model.

We set up dimensions to let users sort and analyze their data from different angles. The semantic engine uses these dimensions to modify queries based on what the user wants. We phrase these dimensions in everyday language so users don’t need to think about database terms.

In our university example, we noted that student_id links to the “Students” table. We did the same for the “Courses” table. The Students table could have many dimensions, such as:

  • The student’s chosen major
  • The student’s year level (like Freshman, Sophomore)
  • The student’s gender
  • Any sports the student is involved in

The Courses table might also have various dimensions, like:

  • Which department offers the course
  • How many credits the course gives
  • Who teaches the course
  • The course name
  • Which semester it’s taught in

We’ve incorporated some of these dimensions into the fct_class_enrollment model, and we’ll detail them further:

semantic_models:
- name: class_enrollment
model: ref('fct_class_enrollment')
description: >
Class Enrollment fact table at the course level.
This table contains one row for each student-course enrollment
and includes the academic term.

entities:
...

dimensions:
- name: term
type: time
type_params:
time_granularity: semester # Selected from the Courses table

- name: course_name
type: categorical
expr: course_name # Selected from the Courses table

- name: department
type: categorical
expr: department # Selected from the Courses table

- name: student_year
type: categorical
expr: student_year # Selected from the Students table

- name: student_major
type: categorical
expr: student_major # Selected from the Students table

Measures

Measures are the quantifiable data points that are usually the subject of analysis. They represent the metrics we want to study. Measures can be aggregated, and often, a primary function of a BI tool is to aggregate these measures across various dimensions.

We define measures so that calculations are consistent across all queries and reports removing any semantic ambiguity.

Using our university example, suppose the school wants to count student enrollments. Here’s an SQL query for that:

SELECT COUNT(enrollment_id) as total_enrollments
FROM fact_class_enrollment

If they feel that this number is too general, and they want to see enrollments for “Introduction to Biology” in Fall 2023, they’d adjust the query:

SELECT COUNT(enrollment_id) as total_enrollments
FROM fact_class_enrollment
WHERE course_name = 'Introduction to Biology' AND term = 'Fall 2023'

And if they want to narrow it down further, to see how many of those were Biology majors, the query would look like:

SELECT COUNT(enrollment_id) as total_enrollments
FROM fact_class_enrollment
WHERE course_name = 'Introduction to Biology'
AND term = 'Fall 2023'
AND student_major = 'Biology'

In all these queries, they’re looking at one thing: the number of enrollments. What’s changing is the context or the angle of their query. This shows that you can have one single measure (like counting enrollments) and use metrics to view it from various perspectives.

semantic_models:
- name: class_enrollment
model: ref('fct_class_enrollment')
description: >
Class Enrollment fact table at the course level.
This table contains one row for each student-course enrollment
and includes the academic term.

entities:
...

dimensions:
...

measures:
- name: total_enrollments
description: "The total number of enrollments for each course."
agg: count # The aggregation that we want to use
expr: enrollment_id # The topic of the aggregation

Typical Structure of a Analytical SQL Query

SELECT
ENTITIES,
MEASURE
FROM table
[WHERE DIMENSION] # Optional
[GROUP BY ENTITIES] # Optional

The Full Semantic Model

semantic_models:
- name: class_enrollment
model: ref('fct_class_enrollment')
description: >
Class Enrollment fact table at the course level.
This table contains one row for each student-course enrollment
and includes the academic term.

entities:
- name: enrollment
type: primary
expr: enrollment_id # Each enrollment is identified by the id
- name: student
type: foreign
expr: student_id # This is the foreign key to the students table
- name: course
type: foreign
expr: course_id # This is the foregin key to the courses table

dimensions:
- name: term
type: time
type_params:
time_granularity: semester # Selected from the Courses table

- name: course_name
type: categorical
expr: course_name # Selected from the Courses table

- name: department
type: categorical
expr: department # Selected from the Courses table

- name: student_year
type: categorical
expr: student_year # Selected from the Students table

- name: student_major
type: categorical
expr: student_major # Selected from the Students table

measures:
- name: total_enrollments
description: "The total number of enrollments for each course."
agg: count
expr: enrollment_id

The Semantic Engine — MetricFlow (dbt Labs)

All the definitions from the previous section need a “Semantic Engine” to make them functional. This engine interprets the data and crafts analytical queries based on those definitions. For instance, even after defining all the aspects of class enrollments, we still require an engine to parse that semantic model and generate a query to count how many Biology majors were in “Introduction to Biology” in Fall 2023. In the context of dbt, MetricFlow serves this purpose.

The “Semantic Engine” is no different than the “Documentation Engine”. When you draft a yml file for a model, on its own, it doesn’t do much. However, dbt’s “Documentation Engine” transforms this data into practical tools like a documentation website, dbt tests, alerts, data contracts, and more. Similarly, MetricFlow acts as dbt’s “Semantic Engine”, interpreting semantic data to produce useful outputs, such as standardized and reusable analytical queries.

Note: While dbt comes with a built-in “Documentation Engine”, anyone or any company can design an alternative version. This alternative would interpret those model yml files differently. The same goes for MetricFlow; it’s dbt Labs’s chosen “Semantic Engine”, but there’s room for others to design their unique versions that read and process those semantic model yml files differently.

To use MetricFlow for generating analytical queries, you first need to establish metrics based on the semantic model you’ve developed. Let’s focus on creating metrics related to total enrollments:

  1. Total Enrollments — This represents all students who have ever registered at the university. This metric directly employs the measure without any adjustments, effectively acting as a “straightforward” metric.
metrics:
- name: enrollments_all_time
description: >
The count of enrollments in the history of the Unviersity
type: SIMPLE # Simple metrics use the measure as is
type_params:
measure: total_enrollments
label: "All Enrollments"

To get MetricFlow working on this metric, use the CLI command mf query --metric enrollments_all_time. MetricFlow will interpret this definition alongside the measure’s definition (outlined in the semantic model) to produce the following query:

SELECT COUNT(enrollment_id) as total_enrollments
FROM fact_class_enrollment

2. Biology Fall 2023 Enrollments — This counts the students who registered for Biology in the Fall 2023 term. A filter is applied to the initial measure for this metric.

metrics:
- name: biology_fall_2023_enrollments
description: "The count of enrollments in the Fall 2023 Biology course"
type: SIMPLE
type_params:
measure: total_enrollments
filter: |
{{ Dimension('course__course_name') }} = 'Introduction to Biology'
AND {{ Dimension('course__course_name') }} = 'Fall 2023'

Initiate MetricFlow for this metric using the command mf query --metric biology_fall_2023_enrollments. Note: An alternative way to achieve the same result is by using the initial metric with a “where” filter, like so: mf query --metric enrollments_all_time --where "{{ Dimension('course__course_name') }} = 'Introduction to Biology' AND {{ Dimension('course__course_name') }} = 'Fall 2023'. After reading the metric definition and its corresponding measure from the semantic model, MetricFlow will provide the required query.

SELECT COUNT(enrollment_id) as total_enrollments
FROM fact_class_enrollment
WHERE course_name = 'Introduction to Biology' AND term = 'Fall 2023'

While the mf query command lets you see a metric on your local CLI, it’s not particularly helpful for wider organizational use. Thankfully, dbt provides APIs that let you trigger these metrics externally, though the specifics are yet to be explored.

This API is pivotal to the Semantic Layer. Everything defined within it should be readily available to business users, regardless of their location. A Semantic Layer, aiming to serve the whole organization, isn’t truly complete without this accessibility. Whether someone is using a python script, a Jupyter notebook, or a BI tool, they need a way to invoke your metric and obtain the results. For instance, AtScale, another tool, allows for this functionality. Suppose I’m working in Python and require data from a certain metric:

import pandas as pd
from atscale import AtScale

semantic_model = AtScale(metric_information = 'enrollments_all_time')
df = semantic_model.get_data()
print(df)

# data gets printed

Someone set up the metric using the interface or syntax required by AtScale’s “semantic engine”. Once that’s done, I can access the metric whenever and wherever I need it.

With a Semantic Layer you’ve successfully built a level of abstraction on top of your data. No matter the changes made to the university’s data, if someone wants to know the total enrollments, they can simply refer to the enrollments_all_time metric. This allows them to further break down the enrollment data as desired.

Note: While numerous other aggregations, measures, and metrics can be developed, the best resource for understanding their implementation and syntax remains the official dbt documentation, which is top-notch.

Benefits of Implementing a Semantic Layer

  • Consistency and Accuracy: A well-defined Semantic Layer ensures that data retrieval and calculations are consistent and accurate across all reports and analyses. The business logic is centralized and validated in the semantic layer reducing the chance of discrepancies or errors. Everyone across an organization refers to data points in the same way.

When someone uses mf query --metric enrollments_all_time, they’ll receive the same result as anyone else using that command. If the definition is updated, everyone will see the same updated result.

  • Quick Access & Natural Language Queries: Users can quickly access the data they need with everyday language without wading through intricate database structures or writing complex queries.

You don’t need to delve into the tables and logic behind total enrollments unless you want to. You can quickly access the data and proceed with your analysis.

  • Scalability, Maintenance, & Adaptability: As the database grows or changes, having a semantic layer means you can make adjustments in one place, rather than updating multiple reports or queries individually.

When the definition of “enrollment” changes, you simply update the enrollments_all_time in the semantic model, and the adjustment is reflected throughout the organization.

Questions, Comments, Concerns, & Thoughts

  • Who is in charge of defining the metrics and overseeing the Semantic Layer?
  • Perhaps semantic modeling strengthens the bond between Analytics Engineers and Data Analysts. Once the AEs set up the marts in dbt, the DAs carry out their desired analytics. After that, they collaborate to determine if they can establish those analytics as standard metrics for the whole business. *Engaging both technical and non-technical stakeholders in the design and evolution of the Semantic Layer can lead to more comprehensive and user-friendly results.*
  • What’s the best approach for long-term and scalable maintenance?
  • Apart from MetricFlow, are there other Semantic Layer options available?
  • Regular audits of the Semantic Layer might be beneficial to ensure accuracy and relevance. How do you define accuracy and relevance and how do you ensure data quality? How would you audit it?

A3I Written (Accelerated & Augmented w/ Artificial Intelligence)

--

--