Data modelling and the semantic layer

Published in

Rittman Analytics Blog

8 min readMay 23, 2022

In the last decade there has been a restructuring within the analytics space. This restructuring has taken the large and rigid BI suites and broken them down into multiple tools that are now represented by the modern data stack. It makes cloud data warehousing, replication, transformation and business intelligence easy to use and access. These new tools make engineering technical solutions cheap, approachable and straightforward. Thanks to these tools, with a robust process and well-defined business questions, a lot of the complexity around engineering has now been mitigated. However, something does appear to have been left behind and that’s data modelling and its relation to the semantics of a warehouse.

Data modelling and its implicit semantics are one of the most important factors when building out an analytics function. A decent data model can make querying effortless, data discoverable and allow an engineer to structure and control how data is consumed. A poor data model is terrible to query and impossible to navigate. Furthermore, data modelling is time-consuming, requires experience and is something of a dark art. These compounding factors mean it can be the weak link within many analytics functions. So, why is it that despite data modelling being so important, it’s lagging behind? It’s this question we’ll look to answer.

The problem

Data modelling and the semantic layer appear to have two problems, the theoretical and the practical. The theoretical problem is that mainstream data modelling hasn’t really progressed since dimensional modelling emerged.¹ The technical problem is that a semantic layer is coupled with a data model and is one of the only areas of the analytics workflow that doesn’t sit squarely within a tool or platform. Function normally follows form so it could be that the lack of theoretical development within orthodox data modelling is limiting the development of appropriate tooling for the semantics of a warehouse.

I would argue that these considerations make data modelling one of the best candidates for future development and presents a great problem to be solved. In order to approach this problem to be solved, we’ll approach the theoretical and the practical. For the theoretical, we’ll be exploring ontologies, taxonomies and then linguistics. Analytics engineers might question why philosophers like Aristotle, Heidegger and Wittgenstein are drawn upon here…I have two justifications:

We’re looking to further a discipline, the western philosophical tradition is a rich discipline so there are likely lesions to the learnt from it
Data modelling deals with the metaphysical, as does philosophy

For the technical, we’ll be considering an open-source engineering toolkit called droughty. It uses the theoretical arguments explored throughout this article as a grounding for its development. It aims to unify data modelling and the semantics of a warehouse.

The theoretical

Dimensional modelling

Dimensional modelling is still the most popular methodology in warehouse data modelling, often now reduced to ‘Kimball Lite’ to fit the modern data warehouse. The concept of facts and dimensions emerged in the 1960s and Ralph Kimball’s ‘Data Warehousing Toolkit’ that most now refer to was released in 1996. This book echoes Fukuyama as nothing has really come to replace it since. When held against the developments within the rest of analytics, this seems unusual and raises the question, ‘why isn’t there an alternative’?.

Ontologies and Taxonomies

It’s not that an attempt to source an alternative hasn’t been sought. Some within the industry have turned to ontologies and taxonomies in the past as candidate subject areas for advancing data modelling.² Could we perhaps take the pinball machine of decision making that is data modelling and make it more trainable? Could data modelling be productised using these subject areas? Could Aristotle help me do this? Perhaps.

Ontologies in data modelling

Aristotle seems to have contributed to almost everything, you could even argue data modelling. When he wrote Categories and formulated his ten categories of existing things: substance, quantity, quality, relation, place, time, position, doing, having, and being affected. By defining these ten categories as well as his notions of primary, predicate substance and final causes he actually made a pretty good framework for classifying substances within a data model.

How can we use Aristotle’s categories to solve issues in data modelling?

If you take Aristotle’s ten categories, map out entities against them and then define the hierarchical relationships you end up with something quite interesting and workable. It can also be useful to frame analytic questions as final causes. What I’m going to explore below is how Aristotle’s categories can be useful when approaching a new set of enquiries and new entities.

The example above shows a web analytics model broken down into a set of primary and predicate substances. This approach is useful for identifying the grain of tables and their cardinality in relation to other tables. The primary entity of a table should be the fixed grain of that table. Any predicate properties of that primary substance should then be placed within that table. An example of this is the event region and the session first_region. Both are regions but both have different primary entities that they belong to. This simple mapping exercise can be really useful when building out conceptual and logical data models but is less applicable to the physical.

The ambitions of ontologies

With this as an approach, you could even posit that you can remove the need for data marts and you could perhaps infinitely scale entities and their relationships. If you apply Heidegger’s concept of Entschlossenheit we have an exciting proposition around an ontological mapping of entities and their relationships with others that disclose facts about the world to us.

The proposition above seems attractive. Could we build an objective data model that systematically maps out entities against predefined categories, recording primary and predicate substances and their relative hierarchies? It certainly lends itself to the engineer’s paradigm.

Wittgenstein and Ordinary Language

What I’m going to explore below is the possibility that ontologies and taxonomies are perhaps not the best approaches for data modelling beyond the conceptual and logical level. The reason for this is that a physical model is really a matter of linguistics.

Ontology relates to being. Primary substance. Sien. Being is not an abstraction of entities or records of entities. It is my argument that data sits more closely within the subject of linguistics than that of ontology. Wittgenstein provides a useful framework by which to frame this argument.

Let’s think about the nature of data and what the purpose of recording it. It tells us something happened, at what time it happened, and perhaps how it happened. It’s a semantic entry that is recorded, stored and in turn conveys meaning. We then use that semantic entry as sense data to maybe make decisions. The data that has been recorded becomes useful when it’s within an operative context, as Wittgenstein would argue, so does language.

Language Games. Are ontologies/taxonomies and linguistics mutually exclusive?

Wittgenstein puts forward the concept of language games. He argues that language is not diachronic or synchronic and that it’s important to recognise that the meaning of a word is not a mental process or a referent that accompanies utterance. Instead, he argues that the meaning of a word is the contribution of a word to a sentence, and to be precise, the sentences use and what a word does in the pragmatic context of human activity. This context of activity is what Wittgenstein calls a language game. A context of activity which contributes to the form of life.

“To understand a sentence means to understand a language and to understand a language means to be a master of a technique”

What this means in the context of data modelling is that the activity and discourse in which data is subject defines its meaning. The business that records it, the analyst that reads it, and the action that it leads to, all represent and mutate the meaning of the data. This casts doubt on the notion a universally objective data model could ever exist. For example, revenue can mean something to sales but something else to finance. A company could be a customer or a provider. A customer is still a person but depending upon the language game it is used within, the mapping of that entity is different. This is why data modelling is so time-consuming as the data model is a conceptual, logical and physical representation of a language game played within that business.

Data marts and discourses

Another interesting parallel between Wittgenstein and the discipline of data modelling is the concept of discourses. A word can mean a number of different things depending on the discourse in which it is referenced. For example, language mutates depending on the setting, such as school, work and socialising with friends.

The technical

How can we use an understanding of linguistics to solve issues in data modelling?

By using linguistics as a framework, what can we automate and infer? Well, we can use semantics within a warehouse to actually infer quite a lot. droughty uses properties of stored data and defined naming conventions to infer facticity, relationships, measures and qualitative information about data. Rather than trying to ontologically map entities, it relies on signs and symbiotic structures to infer connections and properties. It then parses those inferences into a number of different upstream and downstream syntaxes. It uses grammar to extend semantic properties that are a byproduct of the analytics workflow. What we will explore now are the capabilities this provides.

Technical capabilities from semantics

Referential inference
droughty uses grammar to detect a referent. This grammar allows droughty to identify primary key, and foreign key relationships, it can then extend upon this and identify the cardinality between these relationships. This allows droughty to infer relationships between primary and predicate substances. This can then be used for documentation, business intelligence tools or the metric layer to infer joins.
Qualitative properties
meta-data is parsed to pass qualitative properties from the warehouse to create tests for the warehouse or information for a semantic layer
Quantitative properties
Meta-data is parsed to identify quantitative properties, deriving measures for downstream consumption.

As droughty is a semantic toolkit it can parse the outputs above for consumption in a number of tools such as dbt, dbdocs, Looker and Cube.Js.

Conclusion

The purpose of this article was to establish and explore the problem around data modelling and the semantic layer. It aimed to offer theoretical and technical approaches that could advance these underdeveloped areas of modern analytics. It argues that do not need to discard dimensional modelling by going beyond it for a more mature theoretical framework for data modelling. By applying ontologies and Wittgensteinian linguistics to the subject of data modelling, it turns out we can build conceptual frameworks and tooling that unifies data modelling and the semantics of a warehouse, improving the analytics workflow. For those who are interested in seeing this in action, please refer to https://droughty.readthedocs.io/en/latest/ or https://github.com/LewisCharlesBaker/droughty.

What’s next

This article has explored data modelling within the modern data stack, focusing on theoretical frameworks we could perhaps use to advance it. We also took a brief look at how droughty as a tool aims to use these frameworks to make modelling within data warehouses better.

Subsequent articles will focus on practical use-cases of semantic modelling and using droughty to improve the analytics workflow.

¹ I would like to caveat that my statements in this article apply to mainstream data modelling, if you dig deep enough, there are many sources of information for ontological and semantic data modelling. https://www.ontotext.com/blog/knowledge-graph-with-semantic-data-modeling/, https://www.actian.com/semantic-data-model/

² https://www.w3.org/TR/owl-features/, https://futuremodel.io/