Snowflake Schema for Data Warehousing explained with example.

Gaurav Pandey
4 min readJan 16, 2023

--

A snowflake schema is a type of data warehouse schema where the dimension tables are normalized, creating a “snowflake” shape. This means that the dimension tables are split into multiple related tables, reducing data redundancy and improving query performance.

To demonstrate a snowflake schema, let’s consider a simple example of a retail store. The store has a sales table that records all transactions, and several dimension tables for the products, customers, and store locations.

Sales Table:
+------------+-------------+-------+
| Sale ID | Product ID | Price |
+------------+-------------+-------+
| 1 | 100 | 10.99 |
| 2 | 101 | 5.99 |
| 3 | 102 | 15.99 |
+------------+-------------+-------+

Product Dimension Table:
+-------------+--------------+
| Product ID | Product Name |
+-------------+--------------+
| 100 | T-Shirt |
| 101 | Jeans |
| 102 | Jacket |
+-------------+--------------+

Customer Dimension Table:
+-------------+--------------+
| Customer ID | Customer Name |
+-------------+--------------+
| 1 | John Doe |
| 2 | Jane Smith |
| 3 | Bob Johnson |
+-------------+--------------+

Store Location Dimension Table:
+-------------+--------------+
| Location ID | Location Name |
+-------------+--------------+
| 1 | New York |
| 2 | Los Angeles |
| 3 | Chicago |
+-------------+--------------+

In this example, the sales table references the product, customer, and store location dimension tables using the foreign keys (Product ID, Customer ID, and Location ID). This allows for easy querying and reporting on various aspects of the data, such as total sales by product, customer, or location.

It should be noted that while the snowflake schema can improve query performance, it can increase the complexity of data modeling and data loading. Additionally, it can cause a degradation of performance in reporting and analytics if the dimension tables are very large.

To Sum up, Snowflake schema is a useful data modeling technique for data warehousing that can improve query performance while reducing data redundancy. It is important to balance the benefits of the snowflake schema with the potential drawbacks in terms of complexity and performance.

Now that we have an understanding of the snowflake schema and its benefits, let’s take a look at an example of how to implement it using SQL.

To begin, we will create our three dimension tables: product, customer, and store location.

CREATE TABLE product (
product_id INT PRIMARY KEY,
product_name VARCHAR(255) NOT NULL
);

CREATE TABLE customer (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(255) NOT NULL
);

CREATE TABLE store_location (
location_id INT PRIMARY KEY,
location_name VARCHAR(255) NOT NULL
);

Next, we will create our fact table, sales, and specify the foreign keys for the dimension tables.

CREATE TABLE sales (
sale_id INT PRIMARY KEY,
product_id INT,
customer_id INT,
location_id INT,
price DECIMAL(10, 2) NOT NULL,
FOREIGN KEY (product_id) REFERENCES product(product_id),
FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
FOREIGN KEY (location_id) REFERENCES store_location(location_id)
);

Now that our schema is set up, we can start inserting data into our tables.

INSERT INTO product (product_id, product_name)
VALUES (100, 'T-Shirt'), (101, 'Jeans'), (102, 'Jacket');

INSERT INTO customer (customer_id, customer_name)
VALUES (1, 'John Doe'), (2, 'Jane Smith'), (3, 'Bob Johnson');

INSERT INTO store_location (location_id, location_name)
VALUES (1, 'New York'), (2, 'Los Angeles'), (3, 'Chicago');

INSERT INTO sales (sale_id, product_id, customer_id, location_id, price)
VALUES (1, 100, 1, 1, 10.99), (2, 101, 2, 2, 5.99), (3, 102, 3, 3, 15.99);

With the data in place, we can now run queries to analyze the data in various ways. For example, we can find the total sales for a specific product:

SELECT product_name, SUM(price) as total_sales
FROM sales
JOIN product ON sales.product_id = product.product_id
WHERE product_name = 'Jeans'
GROUP BY product_name;

This query will return the total sales for the product “Jeans”

+-------------+------------+
| product_name | total_sales|
+-------------+------------+
| Jeans | 5.99 |
+-------------+------------+

In this example, we used a join between the sales table and the product dimension table to retrieve the product name and then used the WHERE clause to filter by the product name and finally used the GROUP BY clause to group by product_name and sum the price.

In conclusion, Snowflake schema is a powerful data modeling technique for data warehousing that can improve query performance while reducing data redundancy.

It is important to note that the snowflake schema is not always the best option for every situation. In certain cases, a denormalized schema, such as a star schema, may be more appropriate. The choice between a snowflake schema and a star schema will depend on the specific requirements of your data warehouse and the types of queries that will be run.

Another important aspect of data warehousing is the process of ETL (Extract, Transform, Load) which is the process of extracting data from various sources, transforming it to fit the schema of the data warehouse, and loading it into the warehouse. In the case of a snowflake schema, the ETL process can be more complex due to the normalized structure of the dimension tables. However, this complexity is often outweighed by the benefits of improved query performance and reduced data redundancy.

In summary, Snowflake schema is a powerful data modeling technique for data warehousing that can improve query performance while reducing data redundancy. It is important to weigh the benefits and drawbacks of the snowflake schema and consider other data modeling options such as star schema, and also to consider the complexity of ETL process when implementing the data warehouse.

--

--