Yet Another Data Modeling Approach for the Data Lakehouse

Carlos Costa
12 min readJul 22, 2023

A few years ago, I embarked on my PhD journey. At that time, when it came to my professional career, I did not know much other than I wanted to work in the data field, and academia sounded like a nice bet. I loved it! It was the fairly early days of Big Data widespread adoption, although Hadoop was around for several years, and Big Data engines like Spark had popped up a couple of years before. It seemed to me, at that time, that the “art” of Data Modelling had vanished in favor of the gigantic structured, semi-structured and unstructured data swimming around in our precious Data Lakes. Data Warehouses were around of course, as they are still today, but Data Lakes were the new kid on the block… the new shiny toy. In there, things could go crazy pretty fast, and my PhD work was mostly focused around the concept of “Big Data Warehouses”, namely how could we bring the best of both.

Right now, I sit on my porch watching and experiencing the rise of the Data Lakehouse, and I’m thinking: “it is kind of what I had proposed back then… and it’s so cool to see it happening”. But still today I think we are all still trying to figure it out…

Should we use dimensional modeling in the Lakehouse [1]? Should we use flat (fully denormalized) tables like there was no tomorrow? Should we use something else?

Today, I want to attempt to further widespread what I feel can be that “something else”, fruit of part of the work I proposed in my PhD thesis [2a] and other research works [2b]. By making it much leaner in this post and by updating it to fit the Lakehouse world, I believe it can be very useful for practitioners to consider when we are modeling data in Lakehouse… well, or at least to generate some ideas and discussions within the data community ☺

Without further ado, in a nutshell, this is how it looks like…

The Concept of Analytical Object

An Analytical Object represents any business entity with analytical potential. Those can include an event generated in a business process - AKA transactional data (e.g., a sale, a purchase, a return, a complaint, a review, a watch, a like…).

An analytical object is seen as a self-contained and fully denormalized object, composed of 3 types of attributes:

  • Descriptive (your dimensions).
  • Factual (your facts/metrics).
  • Predictive (Machine Learning [ML] generated metrics - predictions or recommendations).

These attributes can be of several types supported by the Lakehouse storage technology of choice (e.g., Delta, Hudi, Iceberg…), and it can be both of simple types (e.g., integer, string, float, …) or nested types (e.g., array, struct, …).

Just a note, don’t overdo it with the nested structures, they are not meant to store a huge amount of records inside them as you would with a fact table… play conservative there.

Analytical Objects’ Attributes can be arranged into Families. This is a logical concept, as the storage technology does not need to physically implement those. Descriptive Attributes can be grouped into Descriptive Families, and Factual and Predictive Attributes can be grouped into Analytical Families. Families help us group the attributes logically as if they belong to specific contexts involved within the object (e.g., all attributes of the store, the customer, the warehouse…). However, it is not mandatory to think of them, if it generates more confusion for your context rather than helping you, just think of Analytical Objects and their attributes, you can skip families completely.

Moreover, there are 3 types of keys that might play a role when modeling an Analytical Object:

  • Granularity Key - let’s not make this fancier than it needs to be, it’s just a logical primary key :P it uniquely identifies a record in a table. It’s named here Granularity Key because Primary Keys were not a focus on Data Lakes at all, and to avoid causing confusion to the readers, we gave it another name, more indicative of the records’ grain. However, Primary Keys might somehow come back to fashion again in the Lakehouse [3].
  • Partition Key - it represents the attribute(s) chosen to implement table partitioning.
  • Clustering Key - it represents the column(s) chosen to implement table bucketing, or other interesting specific storage clustering techniques like Z-ordering [4] or Liquid Clustering [5].

The Concept of Complementary Analytical Object

The Complementary Analytical Object can emerge from two scenarios:

  • The modeling of a core intervenient in several events generated in several business processes, i.e., something crucial that provides context to multiple business processes - AKA master data (e.g., a user/customer, a product, a warehouse, a store, a supplier, an external regulatory body, …).
  • Just an additional logical modeling label put on an Analytical Object which happens to be interesting to join with another Analytical Object, in a scenario that we typically referred to in the Dimensional World as a “Fact-to-Fact join”. This might be interesting, for example, where two Analytical Objects share some Descriptive Attributes which allow them to be joined in their current rawest grain (to be avoided whenever possible) or after aggregating them separately to a common grain (preferable). A perfect example of such a scenario might be an Analytical Object “sales” and an Analytical Object “returns”. The “sales” object becomes a Complementary Analytical Object in our data model as the sales_id (Granularity Key of the object) is also present in the Descriptive Attributes of the “returns” object. Does this mean they always need to be queried together? No, but it means that in certain queries “sales” can complement “returns”, hence the concept of Complementary Analytical Object.

At this point you might be wondering why we are treating both transactional and master data with the same type of modeling object (the Analytical Object), and giving to some of them, in special circumstances, mentioned above, the title of Complementary Analytical Object. Long story short, this concept is just a highlight that an Analytical Object might be complementing another one. It behaves, and has exactly the same composition, as a regular Analytical Object, but its Granularity Key is present in other Analytical Objects that are complemented by it. Specific examples are:

  • “Customer” object complements “Sales”, “Returns”, “Views”...
  • “Store” object complements “Sales”, “Inventory”
  • “Product” object complements “Sales”, “Inventory”, “Reviews”, “Returns”…

“But… please explain to me why it is just not named as Conformed Dimension?”

Well, I could, and believe me, in my PhD thesis I went through all of that, but I believe Maxime Beauchemin recently did it much better than I could do it[6] ;) His take on Entity-Centric Modelling, specially focusing on storing aggregated metrics and more nested analytical structures (e.g., arrays with metrics) inside Conformed Dimensions is one of the biggest motivation factors for Complementary Analytical Objects. Kimball also briefly mentions the concept of “aggregated facts as dimension attributes” in his book [7]. It’s curious how the ideas have converged after a while in this Big Data world :)

So, long story short… at that time, we have named these objects Complementary Analytical Objects, because when they represent master data (or a conformed dimension), they do not only represent descriptive context to other objects in the data model, but also represent objects that have analytical potential when analyzed individually, without the need to join them with other objects.

Therefore, for these cases, creating analytical attributes representing aggregated metrics from several other Analytical Objects they complement can be an awesome idea. For example, why not have analytical attributes that tell you “the average of sales per day for a customer?”, or “if the customer is predicted to churn or not”, or the “last 10 read posts of a user”. This strategy is purely optional, but it can provide a simpler & really powerful metrics layer for the business, without having to create many small aggregated and materialized Analytical Objects flying around everywhere. There goes the idea, and we should pick our poison given the particularities of the situation in front of us :)

“When should I denormalize descriptive attributes inside Analytical Objects vs. creating Complementary Analytical Objects that normalize those attributes”

Check for this criteria when creating a Complementary Analytical Object:

  1. The attributes associated with that object are used in more than one Analytical Object.
  2. and those attributes will form a low-cardinality Complementary Analytical Object (generally and preferably no more than a few tenths of millions of records).
  3. and this Complementary Analytical Object being created can provide analytical value on its own, without the need for it to be frequently joined with other Analytical Objects (refer to the above comments on storing metrics inside these objects).

If these are not true, for that case, stick with data denormalization by having denormalized Descriptive Attributes inside Analytical Objects for faster performance.

The Other Objects in the Framework

Apart from Analytical Objects and Complementary Analytical Objects, the remaining ones don’t require so much explanation. The Date Object is our typical standardized Date Dimension table (information up until “day” in the time hierarchy), as it contains date related attributes that make sense for the specific business context. The Time Object is similar to the Date Object, but it contains more granular time information, like hours, minutes and seconds. The reason they are separated is to avoid the explosion of date/time records (10 years * 365 days * 24 hours * 60 minutes = 5,256M records), which are constantly used to join with records of other objects. Finally, the Spatial Object, is a type of object that simply represents space-related information (e.g., cities, countries…). For the objects of this type avoid going too granular (e.g., including latitude and longitude), as in order to ensure they can be efficiently joined with other Analytical Objects they cannot span to hundreds of millions of records, and one is better suited with having such granular information as regular Descriptive Attributes in the Analytical Objects.

Are these 3 types of objects (Date, Time, and Spatial) mandatory?

No. They are completely optional, and if one identifies that we get significant boosts of performance by including Date, Time or Spatial information as regular Descriptive Attributes inside an Analytical Object (e.g., as timestamps, strings...), then such direction is recommended. Specific time-related attributes, for example, can always be derived at query runtime from those Descriptive Attributes in a performant manner (e.g., name of the month, extended name of the day, …)

What can I do to not treat them as just regular Dimension-like tables?

Use the constructs and ideas discussed above for Complementary Analytical Objects ;) namely the idea of storing aggregated metrics or more nested analytical structures (e.g., array of metrics) inside these objects.

Joining the Objects

When the data model has several objects (analytical, complementary analytical, date, time, and spatial) it’s natural that some need for join operations may arise. If that happens, just adequately leverage the Granularity Keys in the join condition (which can even be stored as metadata in some cases and leveraged for query optimization [3]) and preferably aggregate the objects first to a common grain and then join them.

However, since this modeling strategy heavily relies on denormalization, the need for joins is significantly less than in the dimensional modeling world, which overall may typically contribute to faster query execution times. I know how unsupported this claim might be perceived, and believe me I do have several benchmarks in my PhD thesis to corroborate this for several scenarios, but results may vary depending on chosen technologies, optimizations, sizes of dimension and fact tables, etc. So it’s basically impossible to give a completely affirmative statement that flat tables are more performant than dimensional models, or vice-versa. However, generally-speaking out of practice and some previous research benchmarks, flat tables are perceived as a good way to achieve performance in Data Lakes/Lakehouses.

Example of a Data Model using the Framework

To illustrate the result of applying the modeling framework, we can find below a figure containing the quite common fictional AdventureWorks example, in this case modeled using the concept of Analytical Objects.

Source: Adapted from [2a].

Considering the topics discussed above, the key takeaways of the example are as follows:

  1. We have the equivalent of 6 stars, condensed in just 6 tables, in this case 6 analytical objects. This produces a simple model to understand and generally a data model that can answer big data queries faster.
  2. We have the 4 optional objects being used: date, time, territory (spatial), city (spatial). Optionally they could have been left out of the model as explained above.
  3. We have 3 Complementary Analytical Objects, resembling conformed dimensions that are key in the modeled business context. We may ask why “employee” is not a Complementary Analytical Object complementing “employee_history”. In this particular business context, “employee” was not considered an entity that will be reused heavily in many analytical contexts by many other Analytical Objects, hence the employee information was fully denormalized into “employee_history”. However, in other contexts, given different analytical requirements requiring employee information, “employee” could have been a perfectly valid Complementary Analytical Object.
  4. We see in each Complementary Analytical Object some analytical attributes that perfectly illustrate our “whenever suitable, put facts inside dimensions” guideline. In fact, the “vendor” object even has a predictive analytical attribute, based on the output of the application of a Machine Learning model predicting the probability of delay associated with each vendor.
  5. We see some of the analytical attributes as being nested structures, which is another relevant guideline we give in the framework, whenever applicable to the context. For example, “product” has an array of structs containing the history of the previous costs associated with the product. With this approach, when analyzing a specific product, the analyst has immediate and easy access to the cost history of the product.

Wrap up

In this post, we have gone through yet another way of doing data modeling in the Big Data era. Is it better than the ones already known? Is it worse? Who knows… to me it felt natural when I proposed it, and today, after years of practice, I see it unintentionally applied in a lot of cases, in this world where flat tables are everywhere, and at some point they need to be joined with some critical “master data” tables. Then, why not give it some sort of framework, right? That was my thought for my PhD, my book, and this post.

In general, I see the following advantages with this framework:

  • Generally better performance in query execution, due to the less frequent joins, as significant data denormalization is heavily promoted (even more than in star schemas).
  • Focus on simplicity and understandability of the model. No need to worry about too many constructs like surrogate keys, several relationships, foreign keys, several types of dimensions (mini, shrunken, bridge tables…), slowly changing dimensions, etc.
  • Works both for immutable storage (e.g., Parquet Data Lakes) and mutable storage (e.g., Delta Lake and new ACID storage formats in Data Lakehouses).
  • Promotes nested structures as first-class citizens, when applicable for the use case, which can severely reduce the complexity of the data model, by decreasing the number of tables present in the Data Lakehouse, and it can also give instant analytical power in the current row being analyzed at that moment.

In contrast, the framework may present the following disadvantages:

  • Higher denormalization typically incurs in higher storage costs. Nowadays, with cheap object storage systems and columnar formats, this might not be so much of a concern to us, but nevertheless should be pointed out.
  • If your source system, providing the data to the Analytical Objects, is highly relational, then your ETL workloads might be a bit heavier, with the need for more joins to form a fully denormalized flat-table structure. However, in Big Data contexts, usually the variety of formats includes many more other source systems like REST APIs, NoSQL databases, JSON files, Kafka topics, among others, in which the data is not typically found in a highly relational format. Moreover, even if you have highly relational data as a source, the framework continues to work accordingly, it is just that you may need to leverage a bit more distributed computing power and a bit more complex DAG to get the job done.

I hope you enjoyed the read! For me, it was a pleasure putting together this post for you, and I hope to have contributed somehow to your ammo for discussion :) Till the next one! Appreciated, and happy Lakehouse surfing!!!

References

[1] https://www.databricks.com/blog/data-modeling-best-practices-implementation-modern-lakehouse

[2a] http://repositorium.sdum.uminho.pt/handle/1822/65253?mode=full&locale=en

[2b] https://www.amazon.com/Big-Data-Warehousing-Publishers-Information/dp/8770221847

[3] https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

[4] https://delta.io/blog/2023-06-03-delta-lake-z-order/

[5] https://medium.com/@tsiciliani/liquid-clustering-with-databricks-delta-lake-57dc251d7870

[6] https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/

[7] Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling (3rd ed.). John Wiley & Sons.

--

--

Carlos Costa

Data & Analytics Director, Invited Professor, Book Author, Open Source Contributor & Data Afictionato