Why data teams need a semantic layer?

Blosher Brar
5 min readMay 22, 2024

--

Introduction:

In the 2010s cloud computing became a driving force that allowed organizations to re-imagine how they manage their data. Rather than viewing data teams as an expense to meet compliance and complete meet basic reporting requirements, data-driven organizations used data to optimize operations and gain a competitive advantage.

Image is from the altexsoft website

A modern data stack grew as a response to the technical possibility made available with cloud computing, and used by organizations to analyze data effectively by using cloud-based data warehouses like Snowflake, data integration tools like Stitch, and data visualization software like Power BI. In contrast, traditional data warehouses were based a monolithic architecture, did not integrate well with other software vendors, were unable to scale with increased data or compute requirements. Modern data warehouses are designed to be modular, integrate seamlessly with other tools, and have effectively limitless storage capacity and compute power.

One of the key components of modern data warehouses is the “democratization” of data. Rather than have a centralized data team curate and publish reports & dashboards, organizations have found immense value from creating self-serve reports & dashboards that users across an organization can query and access the data they need to make data-driven decisions.

Below is a typical architecture for a modern data stack using a medallion data warehouse architecture with snowflake and dbt:

Idealized modern data stack

Pain points with the Modern Data Stack (MDS):

However, many organizations have found that in trying to democratize their data, it has caused another set of data problems. Many end users find that the data they need is NOT in the correct format for a visualization or they need to make further calculations to get the results they need to create a report or dashboard.

Moreover, analysts may have different names for essentially the same data record. The marketing team, for example, may refer to a user as a “prospect,” while the sales team might call that same business a “client,” while the finance team calls this user entity a “counter party.” However, a machine learning model or analytics team might want to analyze all this data and relate the data back to a single user.

This often leads analysts to create their own metrics within their local environment or business intelligence tool as the following diagram demonstrates:

The realities of the modern data stack without a semantic layer

However, this can create major problems:

  • localized metrics have no oversight and can be inaccurate.
  • changes in data structure upstream can cause these metrics to break.
  • the work required to calculate metrics is duplicated.
  • metrics created in one BI tool cannot be integrated with other BI tools or used by the data scientist or data engineering team for automation.

What is a Semantic layer?

A semantic layer is a business representation of data and offers a unified and consolidated view of data across an organization. Its important to note that the semantic layer does not hold or store the actual data, it is a metadata and abstraction layer built on the source data

This abstraction over the data warehouse maps different data definitions from various data sources into a unified, consistent, and single view of data for analytics and other business purposes. Semantic layers create pre-defined views of processed data that abstract complexity and apply business-oriented definitions. Metrics like revenue or cost are defined.

Hence, the need for centralized metrics to give users a single source of truth and semantic layer. Organizations can codify their metrics in a centralized place and have confidence that they’re getting the same number and context around that number wherever they consume data.

Modern data stack with dbt’s semantic layer

Benefits of the dbt’s semantic layer include:

  • centralized metrics defined once and used across the organization ensuring accuracy, clarity, and molecularity.
  • metrics can be used in downstream applications including machine learning models, automation tools, reverse-etl tools, and even spreadsheets.
  • increased ease in calculating metrics (dbt uses YAML files).
  • automated documentation to provide metric definition and context.

When NOT to use a Semantic Layer:

Although the semantic layer can help data teams creating functional self-serving reports and dashboards, it is really only needed when the data infrastructure is relatively mature and there are lots of data users using multiple BI tools. The data warehouse used to build metrics must be relatively clean and formatted correctly into appropriate dimension and fact tables.

If you are a data engineer in a startup, building a semantic layer may not be the best use of your time since building a data infrastructure to support a semantic layer will require significant effort.

However, the semantic layer can be a great value add for organizations using dbt that want to standardize metric definitions. If you would like to delve deeper into dbt’s semantic layer, please check out their blog. In my next blog post, I will show how you can use the dbt semantics layer in practice in dbt-Cloud.

Conclusion:

In conclusion, a semantic layer is an innovative approach to solving the metric decentralization problem associated with self-serving reporting & dashboards. It provides a centralized control mechanism to ensure metrics calculations are consistent & accurate while allowing users the flexibility to use data to improve decision making in their day-to-day operations.

Hope you enjoyed reading my article and feel free to reach out to me on LinkedIn to connect and learn more about everything data & technology. Please follow me if you are interested in learning more about the modern data stack and associated tools such as dbt.

--

--