Data Warehouse Schemas Explained: Star, Snowflake, Galaxy

Stefano Meloccaro
5 min readJul 4, 2024

--

In the world of SQL and data warehousing, organizing and structuring data for efficient querying and reporting is paramount. Three of the most common schema designs used for this purpose are the Star, Snowflake, and Galaxy schemas. As each of these schemas has its pros and cons, it is crucial to have a good understanding of their main characteristics in order to design an optimal and robust data warehouse that meets business requirements. Before we discuss their main differences and use cases, let’s introduce the concepts of Fact and Dimension tables, which are essential for understanding how these schemas work.

Fact and Dimension Tables

As mentioned in one of the previous articles, OLTP and OLAP are two different types of database systems optimized for different purposes and workloads. OLTP is designed to handle and process transactions, while OLAP is optimized for analytical purposes (check this article to learn more about OLTP and OLAP systems). Another crucial difference between the two is the way data is organized in the tables. In OLTP systems, tables might contain both quantifiable fields (e.g., quantities, prices) and descriptive fields (e.g., product names, customer details, dates). In contrast, OLAP systems typically store these two data categories in separate tables. Tables containing numeric or quantifiable fields are called “Fact” tables, while those containing descriptive fields are called “Dimension” tables. Now that we have set these premises, let’s start to highlight the differences between star and snowflake schemas.

STAR Schema

The Star Schema is one of the simplest and most straightforward data warehouse schema designs. It’s called a Star Schema because its structure resembles a starburst, with a central fact table connected to multiple dimension tables. A few of the main characteristics of a star schema are that all dimension tables are directly connected to the fact table, which, apart from the numeric and quantifiable fields, also contains all foreign keys referring to the primary keys in each dimension table (check this article if you don’t know what a Primary Key or a Foreign Key is). Moreover, dimension tables are usually denormalized, which, despite a relatively higher level of redundancy, brings significant query performance improvements and a low level of query complexity due to the reduced number of joins involved.

Snowflake Schema

The Snowflake Schema, on the other hand, is a more complex version of the Star Schema. It gets its name because the diagram of the schema resembles a snowflake, with dimension tables that are normalized into multiple related tables. Like the Star Schema, the Snowflake Schema has a fact table at the center connected to multiple dimension tables. However, in a Snowflake Schema, not all dimension tables connect directly to the fact table; many of them connect to the fact table through one or more intermediate dimension tables.

The main characteristic of the Snowflake Schema is that, despite the increased complexity of queries and potentially slower query response times due to the higher number of joins involved (due to the normalization), it has a low level of data redundancy. This makes the data model more scalable and easier to maintain while preserving data integrity.

Galaxy Schema

Adding another layer of complexity, we can introduce a Galaxy Schema (also called Constellation Schema), which is an extension of the Snowflake Schema. The main difference between the two is that in the Galaxy Schema, there are two or more fact tables that share the same dimension tables. This is a typical situation in many medium-to-large companies, where the increasing size of the company also increases the need for multiple fact tables. There are several reasons why companies might need to create different fact tables; some of the main ones include:

  1. Handling multiple business processes that need to be analyzed separately — For example, sales transactions and inventory levels.
  2. Supporting different levels of granularity — One fact table might store transaction-level data (each individual sale), while another fact table might contain aggregated data like total sales per day, per week, per month, etc.
  3. Optimizing query performance — Large companies might have one fact table for sales data and separate fact tables for each region (e.g., one for America, one for Europe, one for Asia). This method increases query performance as queries deal with smaller amounts of data.

Summary of the main differences between the three

Conclusion

As discussed in this article, each schema has its own advantages and disadvantages, and the choice between them depends on specific business requirements. However, as a general guideline, it’s common to start with a Star Schema because it is the simplest to implement and provides good performance for analysis. As business needs grow more complex, you might transition to a Snowflake Schema to handle more detailed hierarchical data and improve data integrity. Eventually, for highly complex and large-scale business processes, you may adopt a Galaxy Schema, which supports multiple fact tables and can accommodate diverse analytical requirements.

Thanks for being a part of our community!

If you found this article helpful and would like to show your support, don’t hesitate to:

  1. Clap on this story
  2. Leave a comment below telling me what you think. This will help me with the next articles
  3. Highlight the parts in this story you were most interested on
  4. Support my work on Buy Me a Coffee ☕️

These actions really really help me out, and are much appreciated!

Follow me for more insights on LinkedIn | YouTube | Blog

--

--

Stefano Meloccaro

Ex-Amazon BI Engineer. Writing about data analytics, tech trends, and business intelligence. Follow for insights and industry advice. 🚀