Data Modeling: Dimensions vs Facts

Published in

Poatek

12 min readNov 22, 2023

Many issues a data team can face are related to a step before writing a single line of code. For many, data modeling can look easy and not an important step. Still, as data professionals, we must know that this decision can affect the cost and time in astronomic ways.

Knowing your consumer

The starting point of the data modeling is understanding the first question before starting the project. “Who will consume this data?” As data engineers, we will have, in general, four archetypes of consumers that will vary in the way that we deliver our products to them:

Data Analysts/Data Scientists Usually want “useful things,” which means we should give easy-to-query datasets with few complex data structures to avoid wasting their time decompressing and exploding arrays instead of doing the analytical job.
Data Engineers: Other data engineers are SQL pros, so it is safe to deliver to them more compact datasets; it means that it probably will be harder to query, but we have the performance trade-off.
ML Models: It will depend on the model and how it is trained, so sometimes we may need to turn the dataset into features and then into a feature store.
C-Level/Executives/Managers: The thumb rule is that we don’t deliver datasets to them; they want to see easy-to-interpret charts to decide.

How we will model our data will depend on what kind of consumers we will have. Some of the biggest problems in data engineering occur when data is modeled for the wrong consumer.

Dimensional Data Modeling

What is a dimension?

A dimension is the attribute of an entity. (e.g., user’s birthday, user’s favorite food).

Some dimensions can identify the entity (e.g., a user’s id).
Others are just attributes that complement the information of the entity.

Snapshot dimensions design

This is the “standard” pattern of designing a dimension table. It will increment the table data with production snapshots from the time set. It means that if we want to track, for example, the daily activity of a user, the table would look like this:

This data is elementary to query, but we can see there are drawbacks; try to imagine how this would look for a platform like Facebook or TikTok; the number of rows would be massive and cost a lot in storage.

Cumulative table design

The cumulative table design enters the scene to avoid the necessity of dozens of rows to track daily activities. We should use this design when we want history monitoring.

To create a cumulative table design, we will need a few core components:

We will get both today’s and yesterday’s snapshot.
We FULL OUTER JOIN them so we can get all the data. Using the user’s example that was seen before, the join would give us people who were not in the database yesterday but are today, or that were yesterday but removed today.
We COALESCE values to keep everything around.
Hang onto all of history.

Using this modeling design has strengths and drawbacks.

Strengths:

We can have the historical analysis without shuffling since we don’t need to GROUP BY data; we can sum it in the array or check if a value exists.
We can have an easy “transition” analysis. For example, we have the historical context in a row to map churn, resurrected, new, or deleted users.

Drawbacks:

It can only be backfilled sequentially. We can’t backfill multiple dates simultaneously; we must pick an initial point and create the table sequentially.
Handling PII data can be messy since deleted/inactive users get carried forward.

Snapshot Dimensions vs Cumulative Table Design

The first difference that we will notice is that in the Cumulative Table Design, we will have arrays with data, with the growth representation looking like this:

So, at the end of the week, we would have a partition with one String and seven Booleans, occupying much less storage than seven rows with data. Also, we can recreate all other previous records with the last partition.

“How many users are active this week?” To answer this question using the snapshot design, we would need to check the is_active column for the seven partitions created for each day of the week, and if in any of them there is a True value, we classify it as active. Conversely, in the Cumulative Table Design version, we only need to go to the 7th partition and check if there is any True value inside the is_active array for each user, using much less compute processing time. Compactness vs. Usability Tradeoff

Generally speaking, usability is always more important than compactness, especially when it takes employees time to use the table.

Tables can be categorized in three ways:

The most usable tables: We will have no complex data types, only primitives, and are easy to manipulate with WHERE and GROUP BY.
The most compact tables: These are not meant to be humanly readable; to use them will need to be decoded, they are not directly available to be queried, and they are required to be as small as possible.
The middle-ground tables: Here, we can find a massive impact as data engineers, allowing us to use more complex data types (e.g., Array, Map and Struct), making the queries trickier and more compact.

When do we use each type of table?

The most usable ones are used when the analytics people are the primary consumers of the table.
The most compact consumers must be very technical, usually the software engineers. They are used on online systems where latency and data volumes matter greatly.
Middle-ground tables are used mainly as upstream staging, where most consumers are other data engineers or technical data scientists.

Slowly changing dimensions

Sometimes, we will have dimensions that change throughout history. Let’s imagine the following scenario: We are working with a vehicle insurance company, and they want to track the car of each user. One person may have had car insurance for his Civic from 2013 to 2015, then this person changed his Civic for a Corolla, bought new insurance in 2016, and is keeping the same one until now with the insurance. If the company wants to track each user’s cars during the last decade, it must store this record.

The thing with slowly changing dimensions is to find the balance and need where the changes happen but not at a fast frequency like someone's weight.

What alternatives do we have to track these changes?

Use the Snapshot dimensions design that we have seen before. Storing daily partitioned snapshots, with one record with the respective value every day, so the changes appear.
Recording the data with two extra columns: start_date and end_date. Using this pattern, we can store only when there are changes, minimizing the storage required to retain the data. A table for the vehicle insurance above would look like this:

As we can see, instead of recording the data from all users daily to capture the changes, we can track when changes happened and use 9999 as current, but it can be changed for NULL, too.

In conclusion, dimensional data modeling, including Snapshot dimensions design, Cumulative table design, and handling of slowly changing dimensions, is a critical component in database management and post-data collection analysis. Snapshot dimension design allows us to track data periodically, though it might create massive and heavy databases. To counter this, the Cumulative table design \was introduced, which, despite its strengths and drawbacks, helps reduce storage costs. The choice among Usable, Compact, and Middle-ground tables is typically dictated by the nature of the data technicians who will primarily use the table, and these decisions must be deliberate and strategically made.

Slowly changing dimensions are accommodated, but they must be balanced carefully not to disrupt the database’s resemblance to real-life scenarios. Doing so gives us the foundation to maintain our data cost-effectively and efficiently. Ultimately, dimensional data models empower businesses to make better decisions, and mastering these techniques provides an invaluable tool in the armor of database management.

Fact Data Modeling

What is a fact?

A fact can be described as something that happened; it is an event. Since it already happened, facts are not slowly changing, so it is easier to model compared as dimensions in some respects.

An example of a fact is a transaction that is made. The transaction happened from someone to someone else, at a particular time, with a certain amount, and it will not be changed.

How does fact modeling work?

When we model a fact table, we should answer the following questions:

Who? Who are the actors involved in the action? Usually, they are pushed out as IDs.
Where? It is most likely modeled out like who with IDs to join, but it is more likely to bring in dimensions.
How? It is similar to where and can be a bit blurry in digital events, like a click on a website, the how can be by an iPhone, and where be the specific button.
What? It tells us the nature of the fact, so we can understand what is the event that is happening.
When? All facts have a temporal component; they are fundamentally part of the nature of the fact.

Answering these questions will make modeling easy.

What makes fact modeling hard?

The main difficulty in the fact modeling context applies to their volume. Fact data is usually 10 to 100x the volume of dimension data. So, if we imagine a platform with 2 billion active users and 50 billion notifications sent through all the users daily if we want to track data for a week, we would have 350 billion notifications to be stored.

Fact data can need many contexts for practical analysis, which means that if we store just what happened, it cannot be very effective, so we may also need a few dimensions.

Duplicates in facts are way more common than in dimensional data and may need some treatment; an interesting approach to solve this is collapsing the dupped data, treating all duplicate data of a day as one event, for example, but we need to be aware that dupped data can be valid in some cases. Still, not all duplicates are good ones to be stored.

When should you model facts with dimensions?

Generally speaking, in fact data modeling, we will not bring anything into the row except for the identifier, keeping aside every other attribute.

In data engineering, we have the rule to keep data normalized when designing a fact table. Still, we need to be conscious that sometimes, joining all the fact events to get a dimension can make the pipeline slow, so sometimes this rule should be ignored and modeled with dimensions, too; it depends on the needs of the client and every decision will have tradeoffs.

Reducing fact data modeling gives you superpowers

The data of fact tables can vary in volume; a way to compact the table can be done through steps of reduction, and we have three levels to do so.

Fact data often has this schema:

user_id, event_time, action, date_partition

It will have a very high volume and row per event.

We can reduce it to a second level by doing a daily aggregation:

user_id, date_partition, action_cnt

Having a medium-sized volume and one row per user daily, it is ideal to deliver for analytics people.

For a third level of reduction, we will use a long-array metrics design:

user_id, month_start_partition/year_start_partition, action_cnt array

This schema has a low volume and only one row per user per month/year. So, with the following table, we could see 90 days of data in just three rows. Usually, the long-array metrics design table is best suited for downstream pipelines since the array needs to be exploded to be explored.

The daily dates are stored as an offset of month_start or year_start, with the array’s first index being the date month_star + zero days, and the last index is the date start + array_length - 1.

One point to notice using the long-array metrics design is that dimensional joins can get weird if we want things to stay performant. The slowly changing accuracy will be the same as the month_start or year_start value until the next one, so we give up the 100% accurate slowly changing dimension tracking for massively increased performance since we can make analysis much faster in this way instead of exploding the arrays to get a precise analysis. That’s why we pick snapshots in time, month_start value, month_end_value, or even both and treat the dimensions as fixed to get the changes through time.

In summary, fact data modeling revolves around events or occurrences and necessitates answering certain fundamental questions for successful implementation, such as Who, Where, How, What, and When. The challenges of fact modeling often arise from its high volume, potential for duplicate data, and requirements for context. Striking a balance between storing fact data and relevant dimensions can greatly enhance data analysis but must be handled wisely, considering performance implications.
One way to handle the large volume of fact data is through reduction techniques, which transition from a high-volume per-event basis to more aggregated levels, ultimately leading to significantly smaller data volume in the design of the long-array metric. This particular design can drastically improve performance by reducing the need for array explosion but sacrifices the precision of slowly changing dimensions.

Overall, fact data modeling is a complex yet supremely invaluable process in data engineering. It allows for a comprehensive understanding and analysis of the occurring events within a system. While it requires careful planning and decision-making, successful implementation can produce highly effective systems that enhance not only data storage but also the accuracy and speed of analytics, subsequently leading to more informed decision-making.

Properties of Facts vs Dimensions

Generally speaking, some rules can help us classify a dimension or a fact, but in some cases, they can be overlapped.

Dimensions:

Usually, these are the things we can GROUP BY when doing analytics.
It can be “high cardinality” or “low cardinality,” depending, but usually, the high ones are the IDs.
Generally, it comes from snapshots of the state.

Facts:

Usually aggregated when doing analytics by SUM, AVG, and COUNT.
Almost always higher volume than dimensions, although some fact sources are low-volume, we can think of them as “rare events.”
Generally, it comes from events and logs.

To summarize, dimensions and facts are two crucial properties in data analysis, each with unique roles and characteristics. Dimensions, commonly sourced from snapshots of the state, often serve as metrics by which we group our data during analytics. They can have either high or low cardinality, with IDs typically assuming high. On the other hand, facts, primarily derived from events and logs, are usually used in analytics for aggregation operations such as sum, average, and count. They typically come in higher volumes than dimensions, though some fact sources can be low-volume if they represent rare events. While there might be occasional overlaps between facts and dimensions, it’s essential to distinguish between them correctly for accurate and effective data analysis. Understanding their differences and appropriate uses provides a robust base for insightful business decisions.

Conclusion

Indeed, understanding and effectively implementing data modeling concepts, precisely dimension and fact data modeling, is imperative in any data-driven business context. Dimensional data modeling allows us to capture and structure key attributes or entities, accommodating variabilities using snapshot designs, cumulative table designs, or handling slowly changing dimensions. However, the choice between the compactness and usability of tables must be strategic and aligned with who the primary consumers will be. On the other hand, fact data modeling focuses on capturing events or occurrences and their relevant dimensions for a complete analysis. Fact data models are usually high volume and can be made more manageable through reduction techniques at multiple levels.

While dimensions and facts carry different but equally important roles and characteristics, they collectively enrich the data analysis. Measurements, typically derived from snapshots, offer a grouping mechanism for data in analytics, while facts, commonly obtained from events, provide an aggregation mechanism.

Understanding these subtle differences between dimensions and facts Paves the way for deeper data insights and supports the formulating of actionable business strategies. The judicious, strategic planning and implementation of these data modeling techniques can power more effective decision-making, optimize costs, and enhance process efficiency, underlining the astronomical potential of competently manipulating and managing data models.

I hope this post has been interesting to you. If you want to join a team of data-driven professionals willing to share knowledge and grow, check out Poatek Careers. There may be a job opportunity waiting for you!