Dimensional Modeling Basics

Published in

Trendyol Tech

13 min readDec 9, 2022

In this article, I will share my notes from the Star Schema The Complete Reference by Christopher Adamson that I read to learn about dimensional modeling. Dimensional modeling is a very comprehensive topic, so only the most important concepts are briefly mentioned in this article. I suggest you read the book to understand the details of dimensional modeling and how it differs from database modeling methods with examples. In addition, the book describes how to start modeling a data warehouse from scratch, and what stages to go through, as well as real-life examples.

Information systems fall into two major categories: those that support the execution of business processes and those that support the analysis of business processes. An operational system supports the execution of a business process, while an analytical system supports the evaluation of the process, through measurement of a business process. Before going into the details of Dimensional design, it is essential to know the difference between these two systems.

Operational Systems

An operational system directly supports the execution of a business process. By capturing details about significant events or transactions, it constructs a record of activities. A sales system, for example, captures information about orders, shipments, and returns. The activities recorded by these systems are known as transactions. The systems themselves are called online transaction processing (OLTP) systems, or transaction systems for short.

To facilitate the execution of a business process, operational systems must enable several database interactions, including inserts, updates, and deletes. The focus of these interactions is almost always atomic: a specific order, a shipment, or a refund. Because it is focused on process execution, the operational system is likely to update data as things change, and purge or archive data once its operational usefulness has ended.

When implemented in a relational database, it is widely accepted that the optimal schema design for an operational system is in the third normal form. The design may be depicted as an entity-relationship model or ER model.

Analytic Systems

While the focus of the operational system is the execution of a business process, the analytic system supports the evaluation of the process. How are orders trending this month versus last month? Is a particular marketing promotion having an impact on sales?

In analytical systems, the focus is not on creating or modifying data, but on retrieving data through queries. These queries can involve large numbers of transactions, in contrast to the operational system’s specific focus on individual transactions. Historic data will continue to be important to the analytic system long after its operational use becomes obsolete.

The principles of dimensional modeling address the unique requirements of analytic systems. A dimensional design is optimized for queries that may query large volumes of transactions, not just individual transactions. It does not support simultaneous, high-performance updates. It supports the maintenance of historic data, even though the operational systems change or delete information.

Dimensional Design

Two simple concepts, measurement, and context are the foundation of dimensional design. The dimensional design supports the analysis of a business process by modeling how it is measured. Every dimensional solution describes a process by capturing what is measured and the context in which the measurements are evaluated.

Without context, a measurement is meaningless. If you are told “sales were $10,000,” there is not much you can do with this information. Is that sales of a single product, or many products? Does it represent a single transaction or the company’s total sales since its foundation? Without some context, the measurement is useless.KPI, trend, channel, date, and transaction type are the most commonly used contexts.

As with the measurements themselves, context is revealed in business questions or reports. For example, gross margin is viewed in the context of product categories and time (the month of January). Sick days are viewed in the context of a department (marketing) and time (last year).

Facts and Dimensions

In a dimensional design, measurements are called facts, and context descriptors are called dimensions.

In a spoken or written statement, the word “by” and “for” is almost always followed by a dimension. Consider the question “What are order dollars by product category for January?” The person asking this question wants a separate measurement for each product category, as indicated by the words “by product category.” Product category is a dimension. Here, “for” is followed by “January.” January is a value or instance of a dimension. The dimension is not named, but we can interpret that it is a month.

Facts tend to be numeric in value, and users want to see them at different levels of detail. You can identify facts by looking for data that people want to “roll up” or “break out.” Again, look at the question “What are order dollars by product category for January?” Here, order dollars will presumably be numeric, and they can be rolled up by categories.

Not everything that is numeric is a fact. Sometimes, a numeric data element is a dimension. The key is to determine how it is being used. Is it something that can be specified at varying levels of detail? If so, it is a fact. Is it providing context? If so, it is a dimension. “Show me margin dollars by order number,” contains two numeric elements. Margin dollars can be specified at various levels of detail, and the speaker is asking that it be rolled up by order number. Clearly, “margin dollars” is a fact. What about the order number? It is numeric too, but the user is not asking to add up the order numbers. Instead, the order number is being used to specify the context for margin dollars. The order number is a dimension.

Other examples of numeric data elements defined as a dimension include size, age, phone numbers, and document numbers. The clue that these elements are dimensions lies in their use. The numbers themselves are not rolled up or down according to the question. Instead, they specify the context by which something else may be rolled up or down.

The Star Schema

A dimensional design for a relational database is called a star schema. Related dimensions are grouped as columns in dimension tables, and the facts are stored as columns in a fact table. The star schema gets its name from its appearance: when drawn with the fact table in the center, it looks like a star.

Dimension Tables

In a star schema, a dimension table contains columns representing dimensions. Dimension tables are needed to analyze data in fact tables. In queries and reports, the dimensions will be used to specify how facts will be rolled up — their level of aggregation. Dimension values may be used to filter reports. They will be used to provide context for each measurement, usually in the form of textual labels that precede facts on each row of a report. They may also be used to drive master-detail relationships, subtotaling, cross-tabulation, or sorts.

Experienced readers designing operational systems will notice that the dimension tables are not in the third normal form. At first, many developers find this disconcerting. For example, there may be a strong desire to move brand, brand_code, and brand_manager into a separate table called brand. A dimensional model serves a different purpose from an ER model. It is unnecessary to isolate repeating values in an environment that does not support transaction processing. Designers do occasionally perform additional normalization within dimensions, although they usually avoid doing so. In such cases, the schema is referred to as a snowflake. The additional tables that result are sometimes called outriggers.

Keys and history

In a star schema, a surrogate key is given to each dimension table. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads data to the star schema. The surrogate key is the primary key of the dimension table.

Dimension tables also contain key columns that uniquely identify “something” in an operational system. Examples in the figure include customer_id, SKU, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns arereferred to as natural keys. The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not.

Slowly Changing Dimensions

Data in dimension tables originates in operational systems. Once information makes its way into a dimension table, it may change in the operational source. For example, a customer’s date of birth is updated to correct an error, or a customer’s address is updated when they move.

Because the downstream star schema uses a surrogate key as the primary key of each dimension table, it does not have to handle changes the same way the source does. The operational system may track the change history of each data element, or it may simply overwrite the changed values. Regardless, the star schema can respond to each change in whatever manner makes the most sense for measuring the overall business process. In every dimensional design, it is crucial to identify how changes in source data will be represented in dimension tables. This phenomenon is referred to as slowly changing dimensions.

There are many types of SCD, but in this article, we will explain the two most used. (For more details: SCD Types)

Type 1 Change — Overwriting

When the source of a dimension value changes, and it is not necessary to preserve its history in the star schema, a type 1 response is employed. The dimension is simply overwritten with the new value. This technique is commonly used in situations where a source data element is being changed to correct an error.

By overwriting the corresponding dimension in the star schema, the type 1 change obliterates the history of the data element. The dimension has no evidence that the column ever contained a different value. While this is generally the desired effect, it can also lead to confusion. Their historic context is retroactively altered if there were any associated facts before the change occurred.

Type 2 Change — Inserting New Row

The second method for responding to a change in source data is to insert a new record into the dimension table. Each record contains the effective and expiration times to identify the period between which the record was active. Any previously existing records are unchanged. This type 2 response preserves context for facts that were associated with the old value while allowing new facts to be associated with the new value.

For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system. If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be helpful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id — 10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.

Type 2 changes preserve the dimensional detail surrounding facts. They may confuse users, however, by appearing to duplicate information in dimension tables. Avoid this confusion by issuing browse queries that select distinct values, and by offering a flag to indicate whether each row represents the current version for its natural key value.

An example of Type 2 SCD

Fact Tables

At the core of a star schema, exists a fact table. Every fact table represents a business process by capturing measurements that describe it. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. For example, the simple order’s star in the figure includes the facts quantity_ordered, cost_dollars, and order_dollars. It also includes surrogate keys that refer to products, salespeople, customers, and order dates.

The fact table is the engine for business process measurement. It holds detailed measurements that describe the process. The facts are accompanied by foreign keys that provide a dimensional context for each measurement.

If dimension tables are wide, fact tables are deep. A fact table usually accumulates rows more quickly than the associated dimension tables. Fortunately, the rows of fact tables tend to be nice and compact. The foreign keys are usually integers, and the facts are usually integers or floating-point decimal numbers.

The compact nature of fact table rows enables the table to accumulate a large number of rows without generating inordinate storage requirements. A list of sample rows from a dimension table would be significantly wider. Not only do dimension tables have more columns but many of those columns contain textual data. Compared to fact tables, however, dimension tables contain relatively few rows.

Grain

The level of detail represented by a fact table row is referred to as its grain. Declaring the grain of a fact table is an important part of the schema design process. It ensures there is no confusion about the meaning of a fact table row, and guarantees all facts will be recorded at the same level of detail.

Grain may be described in several ways. Many schema designers describe grain simply by enumerating the associated dimensions. In most situations, schema designers try to avoid aggregating data before loading it into a fact table. By keeping the most granular data possible, the star can address a wider variety of analytic needs. Initial business requirements may not call for detailed data, but analytic requirements tend to change. If an aggregated schema is built, future requests for detailed information will require starting over.

There are three types of fact tables.

The transaction fact table tracks the individual activities that define a business process and supports several facts that describe these activities. It can provide rich analytic possibilities, and often serves as a granular atomic data repository.

The transaction fact table in the figure tracks the activity in bank accounts. Its grain is defined as one row per transaction per account. The account_facts star provides for rich and varied analysis of transaction activity. It is possible to produce a list of transactions for a specific account, group aggregated transactions by branch or study transactions of a particular type. All of these options may be quite useful. Missing, however, is an effective way to study account balances, which is a crucial metric.

While balances can be determined from transaction history, doing so for a single account requires aggregating all transactions starting from its inception. The sample data at the bottom of the figure represents a series of transactions stored for a single account during the first two weeks of February 2009. Each of these transactions is represented by a row in the fact table. To determine the balance of this account, it is necessary to aggregate all transactions up to the point in time in question. If the account was opened on 2/1/2009, this would be a simple matter of aggregating the transactions shown. If, however, the account has been active since the year 2002, computing the balance for February 12, 2009, might involve aggregating thousands of transactions. This is a simple process, but not effective for studying the combined balance of thousands of accounts.

When the measurement of status is important, the solution is an alternative design called a periodic snapshot fact table, or simply snapshot fact table. The snapshot fact table samples the measurement in question at a predetermined interval. This makes it easy to study the measurement in question, without the need to aggregate a long chain of the transaction history.

The account snapshot in Figure 11–2 records a row for each account at the end of each snapshot period. Since the period is a day, each account will receive exactly one row per day in the fact table. Figure 11–3 shows how this differs from what is stored in the transaction fact table. The table on the left shows transactions for a particular account over two weeks. Each of these transactions generates a single row in the account_facts table. The table on the right illustrates the balance of the account at the end of each day. Each of these periodic balances is represented by a single row in account_snapshot_facts.

As you can see, the snapshot is dense; each account is represented for each day. Not so in the transaction fact table. If you examine the data closely, you will see there were a total of nine transactions in this account during these two weeks. In the snapshot, however, 14 rows are recorded for the account during the two weeks — one for each snapshot period.

Tracking time elapsed at one or more steps of a business process can be supported with a third kind of fact table, called an accumulating snapshot. This kind of design contrasts sharply with transaction and snapshot designs in that fact table rows will be updated regularly. However, the grain of an accumulating snapshot allows it to be used in correlating the dates that various status milestones are achieved and the time spent at each processing stage.