Data Modelling Basics

Published in

Data Engineering

7 min readApr 3, 2023

Data modelling allows businesses to organize and analyze data efficiently, enabling informed decisions. Data Modelling is a way of designing tables in a database to allow for efficient storage of data and retrieval for analytics.

In this article, we will explore the basics of Data Modelling, including the types of data models, key terms, and the process of creating a data model. We will also dive deeper into the concept of normalisation, which is a critical step in creating an efficient data model. Whether you are a business owner looking to improve your data management or a data analyst looking to expand your knowledge, this article will provide valuable insights into the world of data modelling.

Part of the main story: Medium blog link

Best book ever if you already know some data modelling or data warehousing background

Kimball Book: Click here

What is Data modelling?

Let’s say we have raw data for our retail shop that looks like this:

To organize and analyze this data, we can create a data model with a fact table that represents sales and dimension tables that provide additional details about customers and products.

Using the tables I described earlier, we could create the following fact table to represent sales:

We can also create dimension tables to provide additional details about customers and products:

Customer dimension table:

Product dimension table:

We can then use these tables to create an ER diagram that represents our data model:

Types of Data Model:

Conceptual data model:

A conceptual data model defines the high-level, abstract relationships between entities in a system.
It does not include specific details about how the data will be implemented in a database. For example, in the retail shop example, the conceptual data model might include entities such as “Customer,” “Product,” and “Sale,” and show how they are related to each other.
This model would be useful for discussing the overall structure of the system and ensuring that all stakeholders have a shared understanding of how it works.

Logical data model:

A logical data model translates the conceptual model into more concrete terms by specifying the details of how the entities, relationships, and attributes will be implemented in a database. For example, in the retail shop example, the logical data model would include tables such as “Customers,” “Products,” and “Sales,” and define the columns for each table.
It would also specify the primary keys, foreign keys, and other constraints necessary to ensure data integrity.

Physical data model:

A physical data model specifies the physical implementation of the database on a specific platform, such as a particular database management system (DBMS).
It includes details such as the file structures, storage requirements, and indexing strategies. For example, in the retail shop example, the physical data model would specify the specific database software being used (such as MySQL or PostgreSQL), and the physical implementation of the tables, such as specifying the data types for each column, the indexing strategy, and the storage requirements.

Various Data Modelling Terms Explained:

Entities: The main objects in the database, such as customers, products, stores, and transactions.
Attributes: Characteristics or properties that describe the entities, such as name, address, brand, price, location, date, and time.
Relationships: Connections or associations between the entities, such as a transaction being associated with a customer, a store, and one or more products.
ER Diagram: A visual representation of the entities, attributes, and relationships in a database.
Dimension: A set of attributes that describe a particular aspect of the data, such as brand, category, and location in the retail chain.
Fact: A numeric measure associated with an event or transaction, such as the number of products sold, total revenue generated, and popular products in the retail chain.

By thinking about these different data modeling terms, we can create a database that accurately captures the information we need and makes it easy to analyze and understand. This can help the retail chain make informed business decisions and improve its overall performance.

So where do I start data modelling?

Identify the business problem: Begin by understanding the business’s needs and the questions they want to answer. This helps identify the data needed to answer those questions.
Identify the entities: Once the business problem is clear, identify the key entities that need to be tracked in order to answer the questions. For example, in a retail chain, entities might include customers, products, stores, and transactions.
Identify the attributes: For each entity, define the specific attributes that need to be tracked. These may include things like names, addresses, phone numbers, email addresses, or dates and amounts of transactions.
Define the relationships: Establish how the entities are related to each other. For example, each transaction may be associated with a customer and a store, and each transaction may also include one or more products.
Create an ER diagram: The ER diagram is a visual representation of the database structure, showing how the different entities and attributes are related to each other. This helps in understanding the data model and identifying any missing entities or relationships.
Normalize the data: Normalize the data to eliminate redundancy and improve data integrity. This ensures that each piece of data is stored in only one place and eliminates data inconsistencies or errors.
Create dimension and fact tables: Use dimension tables to store descriptive attributes that help understand the context of the data and fact tables to store numerical values that can be analyzed. For example, a product dimension table may contain information about different products, while a sales fact table may contain information about sales transactions.
Implement the data model: Finally, implement the data model in a database management system, creating tables to store the data, defining relationships between the tables, and setting up any necessary indexes or constraints. This enables easy access and analysis of the data.

By following these steps, we can create a data model that accurately captures the information we need, and that can be used to answer the business questions that we have identified. This can help the retail chain make informed decisions and improve its overall performance.

Now to some deapth: What is normalization?

Normalization involves several normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF) that define increasingly strict rules about how data should be organized in a relational database.

First Normal Form (1NF) eliminates repeating groups and ensures each column in a table contains atomic values.
Second Normal Form (2NF) eliminates partial dependencies by ensuring that each non-key column is fully dependent on the primary key.
Third Normal Form (3NF) eliminates transitive dependencies by ensuring that no non-key column depends on another non-key column.
Boyce-Codd Normal Form (BCNF) further eliminates redundant data by requiring that each determinant be a candidate key.
Fourth Normal Form (4NF) and Fifth Normal Form (5NF) handle complex relationships by decomposing tables into smaller tables that maintain certain dependencies.

By applying these normal forms, data is organized in a way that is more efficient, less prone to errors, and easier to maintain.

More details here: https://www.guru99.com/database-normalization.html

What is Dimensional Modelling:

Dimensional modeling is a data modeling technique that optimizes data for querying and analysis in data warehousing.

It involves designing two types of tables: fact tables (numeric measurements) and dimension tables (descriptive attributes).
A fact table can be created for sales transactions, with columns for date, product ID, store ID, and sales amount. Dimension tables could be created for product, store, and time, with columns such as product name, product category, store name, store location, and date.
Data is transformed into a star schema, consisting of a central fact table connected to dimension tables through foreign key relationships.
This allows for efficient querying and analysis by enabling users to filter and group data by various dimensions.
The star schema can be implemented using SQL or ETL tools.

Overall, dimensional modeling simplifies and accelerates data analysis, supports scalability and flexibility, and allows for the addition of new dimensions and facts as business needs change.

Whats new in data modelling for Columnar Data store?

Data modeling in a columnar data store requires a few important steps:

Understand the Data Store: Columnar databases store data in columns rather than rows, optimizing querying and analytics workloads.
Identify the Data Entities: Identify the relevant data points and business entities to be stored.
Design the Schema: Design a schema optimized for columnar storage, grouping related data elements into tables or columns.
Choose a Compression Method: Select a compression method appropriate for the type of data being stored and query patterns.
Optimize for Query Performance: Use partitioning, indexing, and materialized views to improve query latency and overall system performance.
Test and Refine the Model: Continuously monitor the data store, testing and refining the model to improve query performance and efficiency.

Modelling in an Object store like S3:

Define Data Entities: Identify the types of data that need to be stored in S3.
Determine Data Access Patterns: Understand the types of queries that will be performed on the data.
Organize Data into Buckets and Folders: Organize data into logical buckets and folders for efficient access.
Use Object Metadata: Add additional information to each object stored in S3 using metadata.
Use Versioning: Keep track of changes to objects over time using S3’s versioning capability.
Consider Data Governance and Security: Manage access to data stored in S3 and ensure data is encrypted in transit and at rest.
Use Data Lake Architecture: If storing large amounts of data in S3, consider using a data lake architecture.

By following these steps and designing a schema that is optimized for object storage, you can create a secure and efficient data store in S3.

Glossary of terms:

https://medium.com/glossary-of-terms-data-architecture-c51e60bcdfab