Data Modeling in Power BI

Obalanatosin
7 min readJul 12, 2022

--

One of the most essential jobs a data analyst can carry out in Microsoft Power BI is creating a great data model. By executing this task effectively, you can make it simpler for others to comprehend your data, which will make it simpler for both you and them to create useful Power BI reports. A data model is what, then? A data model is, in essence, a conceptual representation of data elements. Tables, columns, and relationships between tables are all parts of data models. They may also provide details about data types and keys.

The following advantages are provided by a good data model: Faster data exploration, easier aggregate creation, more precise reporting, quicker report creation, and simpler report maintenance. It is challenging to provide a set of guidelines for what constitutes a successful data model because every piece of data is unique and is used in different ways. To highlight, you should aim for simplicity while developing your data models.

Database normalization

A set of guidelines and procedures for data modeling is called database normalization. You may compare it to creating or organizing a database.

Normalization has a few primary goals.

  • Removing redundant data: When the same data is kept in different locations, that is. File sizes can be reduced and inconsistent data can be avoided by reducing redundancy
  • To create a design that accurately depicts the real world, including how different corporate entities interact. The data is consequently divided among various tables and linked together by relationships, or a connection between two tables.

Table relationships are established using primary and foreign keys. Primary keys are the column(s) that uniquely identify each row of data that is not empty. The primary key is a shorthand term for the distinctive value that is given to each row. This process becomes important when you are referencing rows in a different table, which is what foreign keys do.

The Kimball Model

There are two key concepts in the Kimball model: facts and dimensions. Facts are the metrics from your business process. Dimensions provide the context surrounding a business process. These combine to form a star schema.

Fact Table

The measurements or metrics from your business process are facts. Fact tables include values for observational or event data, such as sales orders, product counts, prices, transactional dates, and quantity. Keys are how we establish relationships between fact tables and dimension tables. The number of columns should be reduced and their size because they contain a lot of rows.

Dimension Table

Dimension tables contain the details about the data in fact tables: products, locations, employees, and order types. Dimensions provide the rest of the story while a fact may only provide the quantity or frequency. These tables are connected to the fact table through key columns. Dimension tables are used to filter and group the data in fact tables. Dimension tables are typically short and wide. They don’t contain that many rows, but do contain a large amount of context for the facts.

Establishment Survey is the fact table. The Age, Industry, Time are the dimension table. This shows how a star schema looks like

Creating a star schema

Datasets commonly come in a “wide” format. These files can be divided into facts and dimensions to enhance queries and boost efficiency on bigger datasets. For this process, I’ll start by creating the “Establishment Survey” fact and “Industry” dimension tables.

Having only the “Establishment Survey” table, I’ll duplicate the “Establishment Survey” table and rename it to “Industry”.

I want to eliminate a few columns from the industry table because they are unnecessary. I’ll keep only the following columns in the “Industry” table: NAICS code, NAICS Code Description, Industry group code, Industry group, Subsector Code, Subsector, Sector Code, and Sector, then remove duplicate values from the dataset.

Going back to the “Establishment Survey” table, I’ll take out the columns I just added under “Industry,” making sure to keep the NAICS code. I’ll utilize this column as the connector between the dimension and the fact table.

Back to Model view I’ll verify that there is a relationship between “Industry” and “Establishment Survey”.

You may occasionally need to import a different file as a dimension. By doing this, you will be able to gain even more pertinent information using your facts. It would be helpful in this situation to have more time-related data.

I’ll import the file named Time.txt from the Datasets folder and and verify that there is a relationship between “Time” and “Establishment Survey” tables.

Next I’ll import the file named EstablishmentAge.csv from the Datasets folder and rename the table “Age”, then remove the duplicate values from the dataset

The foundation of well-organized reports is provided by star schemas and the underlying data model; the more time you spend setting up these relationships and designs, the simpler it will be to generate and manage reports.

Snowflake schema

Similar to a star schema, the snowflake schema provides relationships between dimensions. In the example you can see that the Lender and Property dimensions each link to other dimension tables. Note that fact tables remain the same. The main distinction between the two approaches is how they deal with hierarchical data. Star dimensions often have all levels of a hierarchy in the same table, whereas snowflake dimensions expressly break up hierarchy levels into different tables.

Here are some drawbacks of star schemas that make using snowflake preferable. Star schemas duplicate a significant amount of data, which increases storage costs and may have an impact on performance. Star schemas are also not the best choice for constantly updated data, especially when the dimensions are large.

Keeping with the last star diagram. I’ll explore the Industry dimension’s modeling in more detail. Given that the hierarchy is Sector > Subsector > Industry group > NAICS Code, I will duplicate the Industry table four times and call each copy “NAICS code,” “Industry group,” “Subsector,” and “Sector.”

I’ll keep the code for the level of the hierarchy that corresponds to each table, as well as the code for the level below it, and then make sure to remove any duplicate data.

  • The “NAICS code” table should contain NAICS code, NAICS code description, and Industry group code.
  • The “Industry group” table should contain Industry group code, Industry group, and Subsector code.
  • The “Subsector” table should contain Subsector code, Subsector, and Sector code.
  • The “Sector” table should contain Sector code and Sector.

I’ll delete the automatically created relationships between the “Industry” star dimension and the four tables I just created.

I’ll create a relationship between “NAICS code” and “Establishment Survey” tables, using NAICS code. I’ll also check that the other snowflake dimensions are connected to each other in the right order.

With snowflake schemas, there will be less duplication which means updating records is more efficient. However, star schemas are easier for business users to understand and most BI tools optimize performance for this schema.

The Performance Analyzer

Power BI has a built-in performance analyzer. When activated, it monitors at least three essential metrics on each visual. The first is the amount of time it took to retrieve the data from its internal database and then execute any DAX operations on it. The second step is to determine how long it took for the visual to render. Finally, a third metric captures everything else, which is often waiting time on other activities, such as cross-filtering processes.

There are several ways to improve performance. If the DAX query takes a long time to execute, you could tweak your DAX operations or increase the performance of your data loading. This might include enhancing your data model! If visual display is a major issue, employ simpler visuals and display less information on the screen. Because Power BI must render each data point, plotting tens of thousands of points may take some time. If the Other value is causing your delay, you should consider reducing the number of visuals on the page.

Thank you for reading. You can reach me here

--

--