Demystifying Semantic Layer / Headless BI

Tak Watanabe
9 min readDec 17, 2023

--

This article was originally written by me in Japanese and published here.

What is this article about?

In 2023, behind the buzz around Large Language Models (LLMs), the data industry also has high hopes for semantic layers and headless BI. This article describes how these technologies could shape our future, based on the author’s personal interpretation and expectations.

What is the semantic layer?

Imagine a helpful assistant standing between a complex database and the people who need its insights. That’s essentially what the semantic layer is. Airbyte, a prominent data integration platform provider, characterizes the semantic layer as follows:

A semantic layer is a translation layer that sits between your data and your business users. The Semantic layer converts complex data into understandable business concepts.
Quoted from The Rise of the Semantic Layer, Airbyte

Traditionally, leveraging data in a database to gain insight has required users to understand the data schema and write precise SQL queries for each desired result.

However, it is unrealistic to expect all data users to have these technical skills, as data analysts and data engineers are no longer the only data users.

The semantic layer helps address this challenge. With the semantic layer, data consumers can get the data they need by simply specifying familiar internal terms, such as “order date” for the dimension* or “sales” for the measure*. This is possible because the semantic layer generates SQL queries based on the specified terms.

*Dimension: An attribute within an aggregation result, such as order date or customer name. In SQL, dimensions are specified with the GROUP BY clause to organize data into groups.

*Measure: An aggregated and usually numeric value within an aggregation result, such as total sales or profit. In SQL, measures are calculated using aggregation functions such as SUM, AVG, MIN, MAX, or COUNT.

In a technology context, the term “semantic” typically describes technologies that interpret the meaning of data or words and convert them into formats that are understood by integrated systems, as seen in “Semantic Search”. Similarly, in a semantic layer, the term “semantic” refers to the layer’s ability to interpret specified business terms and generate the appropriate SQL to interact with databases.

How does the Semantic Layer generate SQL queries?

The way in which the semantic layer generates SQL queries from the provided terms can be understood by knowing the information that the semantic layer holds.

Typically, a semantic layer is built by predefining information such as:

Basic information about the databases and tables

  • Information necessary for database connection (host name, credentials, etc.)
  • Names of tables and columns in the databases
  • Relationships between tables

Definitions for each business term (e.g., sales, date).

  • Correspondence with the columns and calculation definitions
  • Classifications of dimensions and measures for each term

Once the semantic layer is built, data users can aggregate data according to three conditions: dimension, measure, and filter.

Let’s say you specify the following conditions:

  • Dimension: Order date
  • Measure: Sales, Number of Customers
  • Filter: Year 2023

Based on this condition, the semantic layer would execute the following SQL query against the database:

SELECT 
order_date AS 'Order Date',
SUM(sales) AS 'Sales',
COUNT(DISTINCT customer_id) AS 'Number of Customers'
FROM orders
WHERE YEAR(order_date) = 2023
GROUP BY order_date;

The data obtained by the above SQL is delivered to the data consumer, who can then visualize and analyze the data. In this way, the semantic layer achieves data aggregation by the consumer without the need to write SQL directly.

The Semantic Layer: A Familiar Technology

The semantic layer is not a new concept. This technology, which makes it easier for business users to retrieve data without having to write SQL directly, has been present in the BI/DW domain since it was patented by SAP in 1991.

It is also an integral part of our popular BI products. You may know that when you drag and drop fields (or columns) such as “Sales” and “Order Date” in BI tools, many SQL queries are issued to the databases. This is usually made possible by a semantic layer embedded in the BI tool.

A Challenge with the Traditional Semantic Layer

One challenge with semantic layers in traditional BI tools, however, is that their scope is often quite limited.

In particular, in the “self-service BI” paradigm that has been popular since the 2010s, the semantic layer is often defined independently by BI users within each workbook or dashboard.

While this paradigm has undoubtedly helped many organizations make better use of data, it has also created problems, such as different interpretations of business logic, resulting in different numbers for the same metric. For example, one department’s dashboard might include returns in “sales,” while another department’s dashboard might not.

In addition, data analysis tools other than BI may be connected to the data warehouse. For example, someone might perform advanced analysis from Jupyter notebooks, or someone might create embedded reports using various APIs.

In such a situation, the definition of the business logic for each business term can become dispersed. Definitions can become even more dispersed as more data is used within an organization.

The Universal Semantic Layer and Headless BI

To address this challenge, attention has begun to focus on the Universal Semantic Layer.

The universal semantic layer can be connected by many services, including BI tools, notebooks, and custom applications, to retrieve data under a consistent and managed semantic model.

The leading service that provides a universal semantic layer is Cube. Cube provides connectivity to a wide range of applications and is compatible with various data warehouses.

Image sourced from: https://github.com/cube-js/cube

Cube provides a single semantic layer (center of the image) to which a wide variety of applications (BI tools, embedded analytics, notebooks, and other applications) can connect. It also supports connections to various data warehouses such as Snowflake and BigQuery.

Because the definition of the semantic layer is centralized, this provides a mechanism that gives each data user confidence in the definition of the data being displayed, while still allowing them to use the BI tools or other services they are familiar with.

In addition to managing logic, Cube provides services that control access to data, accelerate performance, and support a wide range of APIs, including SQL, REST, and GraphQL.

These services, such as Cube, which address the limitations of logic management in traditional BI tools while providing additional functionality that is valuable to data users, are also referred to as “ Headless BI”.

Headless means having no front end, and in the context of data, it means having no graph or dashboard visualization capabilities. From this point forward, this article will consistently refer to Headless BI, including the universal semantic layer.

Benefits of Headless BI

By implementing Headless BI, organizations can expect to gain several key benefits, including:

1. Increase confidence in data
2. Improve performance and reduce data warehouse costs
3. Accelerate data democratization

1. Increase confidence in data

The primary benefit is improved data reliability through integrated management of business logic.

The consolidation of logic in Headless BI is expected to reduce the dispersion of business logic and streamline version control and modification for it.

Although existing ETL tools can manage the definitions of tables and views in a data warehouse, they have limited control over the SQL logic that is issued to those tables and views. With Headless BI, you can also control the logic of queries issued from many applications.

2. Improve performance and reduce data warehouse costs

Practical benefits include improved query performance.

For example, Cube provides two methods for query acceleration: in-memory caching (request cache) and pre-aggregation data (pre-aggregation store).

Image from: https://cube.dev/docs/product/caching

Pre-aggregated data (Pre-aggregation store) enables the materialization of calculation results on commonly used dimensions and measures within Cube or your data warehouse.

This feature is expected to be a cost-saving measure for data warehouses because it can reduce the number of times a query is made to the data warehouse and improve the performance of individual queries.

3. Accelerate data democratization

The use of headless BI will also contribute significantly to the acceleration of data democratization.

Understanding the data model, knowing SQL, and building the semantic layer with self-service BI tools requires a certain level of knowledge and technical skills.

With Headless BI, however, data users can leverage a pre-built semantic layer that allows them to visualize and analyze aggregated data using only familiar terms without acquiring such technical skills.

As a result, the technical barriers to data use are lower than before, and it is expected that more people will be able to use the data.

Integration with LLM

Integration of Headless BI and LLM is another area of focus.

In 2023, leading BI tool vendors have announced LLM integration capabilities and initiatives.

These LLM integration features have enabled users to perform not only data aggregation queries such as “How much was the sales yesterday?” or “How much was the sales for each customer in the APAC region last month?” but also more complex analysis queries such as “Why are the sales lower this month than last month?” or “What are the characteristics of customers who are likely to churn?

But even though SQL queries can be easily written using natural language, ensuring high data quality of the output is still critical. Headless BI will definitely help us trust the data displayed by LLM applications.

What types of organizations are best suited for Headless BI?

Let’s explore the types of organizations that will benefit the most from adopting Headless BI.

I believe that organizations with the following characteristics will benefit significantly from adopting Headless BI:

  • Data-driven decision making is widespread throughout the organization.
  • Data users have different levels of data literacy.
  • Diverse applications connect to the data warehouse

The Kimball Group, founded by BI/DW maestro Ralph Kimball, describes the only scenario in which building a semantic layer could be considered “unnecessary”:

There’s only one scenario where I might buy the argument that you can get away without a semantic layer. If the doors to your DW/BI system are closed to all ad hoc users, and all access is mediated by professional report developers, you can make it work without a BI semantic layer.
Quote from:
Design Tip #158 Making Sense of Semantic Layer

For those who want to learn more about Headless BI and Semantic Layer

There are wonderful Headless BI articles from people around the world that I referenced in writing this article.

The Rise of Semantic Layer: Metrics On-The-Fly
This is a great article from Airbyte that explains Semantic Layer. It covers comprehensive topics on the semantic layer, from the history of the semantic layer to how it differs from OLAP cubes and data catalogs.

Headless BI vs Self-Service BI
Using Power BI’s shared datasets as an example, this article explains that headless BI is not always necessary if you leverage the existing capabilities of self-service BI; the same can be said for Tableau’s published data sources, etc. If BI tools are not in disarray in your organization, this article is worth a look before implementing headless BI.

--

--