Deepening into Data Warehouse…

Alex Souza
blog do zouza
Published in
10 min readFeb 6, 2022

Hey guys!
This article aims to give an overview of Data Warehouse which is, so to speak, the basis for traditional Business Intelligence … we will present some concepts and techniques used on a daily basis… but before that, let’s talk a little about BI ( Business Intelligence ).

#Tip1 — Power BI, Tableau… are not BI. BI is a concept, not a tool…

What is Business Intelligence?

Term used to describe management decisions made on the basis of evidence and facts (data).

Another definition for BI: it is responsible for strategically using data in organizations from the collection , treatment and analysis of any and all types of relevant information, enabling the best decisions for the business.

The figure above shows that BI is a very broad concept, surrounded by several other more specific concepts and techniques, so to speak.

Opening a parenthesis (in relation to #Tip1): Power BI, Tableau enter the specific concept in the image above called: Self-service BI

Business Intelligence is not JUST a visualization with effects and animations, but it is also… Quality data (consistent, complete, standardized data, with numerical precision…), that have a Data Catalog and a Data Warehouse (multidimensional modeling — fact | dimensions) well done and documented… All this to meet the needs of decision makers , generating analyzes and insights that really add value to the business.

Beautiful dashboards , full of animations, colors, etc., if they don’t add value to the business , it’s simply a vanity metric .

Dashboards offer visual displays of important information consolidated and organized on a single screen , so they can be digested at a glance and easily explored and deepened.

Data Warehousing

A Data warehouse is a collection of data produced to inform decision making; it is also a central repository of historical and current data of potential interest to managers across the organization.

Data Warehouse is like a relational database designed and optimized for analytical needs (OLAP).

Data Warehousing is the act of organizing and storing data in a way that makes its retrieval efficient and insightful. It is also called the process of transforming data into information.

Can you explain this image there?
Yes, there is a way… come on…

Data Sources are the multiple sources of data that we have in our organization, they can be databases (on-premises or in the cloud), spreadsheets and etc, these data are Extracted, Transformed and Loaded ( ETL ) to a temporary repository called ODS (O perational Data Store , by many also called Stage area ) and then the entire process and application of Dimensional Modeling (OLAP) techniques is carried out to create the optimized tables for analysis in the Data Warehouse ( Data Warehouses (DW) , can be divided into Data Marts, a kind of sectorial DW) and finally, users use this Data Warehous and/or Data Marts to perform their analysis and generate their insights .

Let’s go now to some concepts that permeate the Data Warehouse, I won’t go into too much depth, but it will serve as a guide for your studies..

OLAP operations

OLAP — Online analytical processing
It is the ability to manipulate and analyze a large volume of data from multiple perspectives (cubes).

A Data Warehouse is modeled on the OLAP concept , while traditional (relational) databases are modeled on the OLTP ( Online Transaction Processing) concept — this is optimized for insertions, updates, deletions, that is, for everyday needs of the operation of its systems: ERP, CRM, Accounting, Financial and etc).

OLTP — Optimized for insertions, updates and deletions of data there in the day to day of the operation | OLAP — Optimized for analytics (focus of our study here)

operations

Let’s use the cube below to demonstrate some of the operations, it contains 3 dimensions: Location , Time and Items

study cube

Slicing or slicing ( Slice )
It selects a single dimension of the OLAP cube which results in the creation of a new subcube (“slicing to a fixed value”). Based on the given study cube, the Slice runs on the Time = “Q1” dimension.

Split or mince (dice)
It returns a subcube of the OLAP cube by selecting two or more dimensions (“value range selection”).

Increase/decrease focus ( drill-down / roll-up )
User navigates through data levels, going from the most summarized (less focus — roll-up ) to the most detailed (more drill-down focus )

In the drill down operation , less detailed data is converted to highly detailed data. It can be done by: — Moving down the concept hierarchy — Adding a new dimension

In roll up , it is just the opposite of the previous one. It performs aggregation on the OLAP cube. It can be done by:
- Moving up the concept hierarchy
- Reducing dimensions

Pivot
Used to modify the dimensional orientation of a report or query view page. It is also known as a rotate operation as it rotates the current view to get a new view of the representation. On the subcube obtained after the slice operation , executing the pivot operation provides a new view of it.

Developing a Data Warehouse

Data warehouse development approaches

The Inmon model : EDW approach (top-down)
Kimball’s model : data mart approach (bottom-up)

Which model is the best?
What meets your business needs.

Detailing the approaches a little…

Top Down (Inmon)
Makes a whole, which will serve all sectors
Advantages : centralization, globalization vision
Disadvantages : slow implementation and high risk

Botton Up (Kimball)
Does it by sectors and then goes up to a whole
• Advantages: implementation and quick
turnaround • Disadvantages: integrations between areas, decentralization

Combined
Top Down Planning and Bottom Up
Development • Whole planning (not just sectoral) and implementation done by sector rather than as a whole.

Top-Down x Bottom-Up

Data representation in Data Warehouse

Dimensional Modeling
A system based on data retrieval that supports high volume access via queries.

Star Schema :
Most used and simplest dimensional modeling
style • Contains a central fact table surrounded and connected by several dimensional tables

Snowflake Schema :
An extension of the star schema such that the entity relation diagram looks like a snowflake shape .

Galaxy Schema:
Also known as the Constellation of Facts scheme. Contains more than 1 fact table. Dimensions that are shared are called conformed dimensions .

Multidimensionality

The ability to organize, present, and analyze data by multiple dimensions , such as sales by regions, by products, by salespeople, and by time (four dimensions, in the example)

Multidimensional presentation :
Dimensions : products, sales, market, segments, business units, geographic locations, distribution channels, countries or industries
Measures : money, sales volume, people, inventory profit, reality versus forecast
Time : daily , weekly, monthly, quarterly or yearly

Star Scheme versus Snowflake Scheme (Contextualized above)

Star Schema (Star Schema)

Here we will delve a little deeper into the Star schema…

The name “ star ” is due to the arrangement in which the tables are located, with the fact table being centralized and relating to several other dimension tables. See an example of the Star Schema structure below. In this model, data is “ denormalized” to avoid joins between tables, reducing query time , however, due to data repetition, it uses more disk space .

The advantage of this model is the efficiency in extracting data, which is a great advantage when it comes to a data warehouse

Example of a Star Schema Modeling

Elements of Multidimensional Modeling

  • Dimension tables
  • Facts tables
  • measurements

Dimension Table

A dimension is a collection of textual attributes that are highly correlated with each other.

Examples:
Product (product name, product detail, product line, product sales unit…)
Customer (client name, cpf, cnpj, address, city, state…)

In a retail database, for example, dimensions such as product, warehouse, customer, promotion and time are common.

Example dimensions (highlighted in red)

Dimension Table — Date
A valuable resource without a doubt is a good Date (or even calendar or time) dimension . Although a star (a fact-dimension set) is in itself a great analytical resource, it is of little use if it does not add the ability to analyze data against time (Year, Semester, Month, Week, day…)

What the Kimball Group says… | Generating by SQL | Generating by Power BI using DAX

Facts Table

The fact table summarizes the relationship between the various dimensions. This is because the fact table key is the association of the dimension tables’ primary keys.

Usually this type of table has one or more numeric facts (eg keys, dates, values, quantities) of “facts” that occurred (eg a sale, purchase, production), in addition to the keys to the dimensions (eg customer , product, seller) that are part of that “fact”.

Example of a Fact Table (highlighted in red)

Measurements

Measures are the numerical attributes that represent a fact , the performance of a business indicator relative to the dimensions that participate in that fact.

Examples of Measures are:

  • The total value in reais of sales by product;
  • The number of product units sold;
  • The quantity of products in stock;
  • The cost of sale per Seller, among others.

Multidimensional Modeling Techniques

Here I will show a sequence of images of how to transform a Relational Modeling — OLTP (Relational Database — Operational) to a Multidimensional Modeling .

From Relational to Multidimensional

Relational Model (OLTP)
Model that is used in operational databases (ERP, CRM…).
In our example, we have a relational modeling of a sales database.

Relational Model (OLTP) — Model that is used in operational databases (ERP, CRM…)

Example 1 — Dimensions involving Person

  • From Relational…
  • to the Multidimensional
This is how it looks in the Multidimensional model (dimensions)

Example 2 — Object dimension

  • From Relational to Multidimensional

Example 3 — Sales fact

  • From Relational to Multidimensional

Example 4 — complete
Here is the complete model, transforming a Relational Modeling (OLTP) to a Multidimensional Modeling

  • From the Relational
  • to the Multidimensional

Storytelling

Storytelling is the detailed description of concepts, opinions and personal experiences through narratives that provoke emotional responses and insights . It’s about leveraging stories to engage your audience or clarify anything. And it’s one of the easiest ways to get your audience together to make a conversation lively and engaging.

Safety

Information security and privacy are the primary concerns of a data warehouse
professional • Protect your most valuable assets
• Government regulations ( LGPD , etc.) • Must be explicitly planned and executed

Massive Data Warehouses and Scalability

Scalability — The main issues surrounding scalability are:

  • amount of data in the data warehouse;
  • the expected pace of growth;
  • the number of concurrent users;
  • the complexity of your queries.

Having good scalability means that queries and other data access functions will grow linearly (at best) with the size of the data warehouse .

I know it’s not all wonderful… what are the most common problems in everyday life?

Lack

  • Documentation
  • No Entity and Relationship Diagrams
  • No Data Dictionary — Generate Dictionary (SQL Server)
  • Most of the time, we only have the database tables
  • “Data Warehouses” taking data straight from production
  • “Data Warehouses” without using multidimensional modeling

Issues in implementing data warehouses

  • Starting with the wrong sponsorship chain
  • Setting expectations you can’t meet
  • Behaving in a politically naive way
  • Load the data warehouse with information just because it’s available
  • Believing that designing databases with data warehouses is the same as designing transactional databases
  • Choose a technology-centric, rather than user -centric, data warehouse manager
If you found this article helpful. Leave your applause and follow me on Medium.

--

--