Part 5 -Building Your AI-Ready Data Stack: Masterclass in Data modeling

7 min readJul 4, 2024

10-part series about building your first Data stack from 0 to 1, and be ready for AI implementation.

Hello Readers,

Imagine you’ve inherited a vast, ancient library. But instead of neatly organized shelves, you find mountains of books, scrolls, and papers scattered haphazardly across the floor. This is the state of raw data in many organizations — a wealth of information, but chaotic and difficult to use.

Well modeled data is more like image on the right side, better right ? Data modeling is the art and science of transforming this chaos into a well-curated library where every piece of information has its place, and knowledge can be easily accessed and combined in powerful ways.

Hence, before we jump into next phase of this series, which is building data pipeline and Analytics, we need to model our data into this well curated library, This will be crucial step to build any analytics or machine learning use case so I felt it deserve its own masterclass.

In this masterclass, we’ll embark on a journey through the realm of data modeling, using our e-commerce predictive analytics project as our guide. We’ll explore:

The different types of data models and their purposes
The step-by-step process of building a data model
How to choose the right modeling approach for your needs
Implementing our e-commerce data model with dbt

So, let’s roll up our sleeves and start organizing our data library!

Types of Data Models: From Abstract to Concrete

Just as a library has different levels of organization — from broad subject areas to specific shelf locations — data modeling involves several layers of abstraction. Let’s explore each type:

a) Conceptual Data Model: The Library Catalog

Purpose: Provides a high-level view of the main data entities and their relationships
Audience: Business stakeholders, project managers
Example: In our e-commerce world, this might show entities like “Customer,” “Product,” “Order,” and “Session” and their basic relationships

b) Logical Data Model: The Dewey Decimal System

Purpose: Defines the structure of data elements and their relationships, independent of any specific database system
Audience: Data architects, business analysts
Example: For our “Customer” entity, this would define attributes like customer_id, name, email, and their data types

c) Physical Data Model: The Shelf Layout

Purpose: Specifies how data is physically stored in the database, including tables, columns, indexes, and constraints
Audience: Database administrators, developers
Example: This would include the SQL to create our “customers” table in Redshift, with appropriate indexes and foreign key relationships

d) Relational Data Model: The Card Catalog

Purpose: Organizes data into tables (relations) with defined relationships between them
Audience: Database designers, application developers
Example: Our e-commerce data in tables like “customers,” “orders,” “products,” with foreign key relationships

e) Hierarchical Data Model: The Library’s Organizational Chart

Purpose: Represents data in a tree-like structure with parent-child relationships
Audience: Developers working with specific hierarchical data structures
Example: Not commonly used in e-commerce, but might apply to product categories (Category -> Subcategory -> Product)

f) Network Data Model: The Library’s Social Network

Purpose: Represents data as a network of interconnected nodes
Audience: Developers working with graph databases or complex relationships
Example: Modeling product recommendations based on purchase history and user similarities

g) Dimensional Data Model: The Reading Room

Purpose: Optimizes data for analytical queries, often used in data warehousing
Audience: Business intelligence developers, data analysts
Example: A star schema with a central fact table of sales transactions, surrounded by dimension tables for customers, products, and time

The Data Modeling Process: Building Our E-commerce Library

Now, let’s walk through the process of building a data model for our e-commerce predictive analytics project:

Step 1: Requirements Gathering

Interview stakeholders to understand their analytical needs
Identify key business processes and metrics
Define the scope of the data model

Example:

Marketing team needs customer segmentation for targeted campaigns
Product team wants to analyze product performance and recommend related items
Finance team requires accurate sales and revenue reporting

Step 2: Conceptual Modeling

Identify main entities and their relationships
Create a high-level entity-relationship diagram (ERD)

Example Entities:

Customer
Product
Order
Session
Campaign

Step 3: Logical Modeling

Define attributes for each entity
Specify relationships and cardinality
Normalize data to reduce redundancy

Example for Customer entity:

Attributes: customer_id, first_name, last_name, email, signup_date
Relationships: One Customer has many Orders (1:M)

Step 4: Physical Modeling

Choose a database system (e.g., Amazon Redshift)
Define tables, columns, data types, and constraints
Create indexes for performance optimization

Example SQL for Customer table:

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100) UNIQUE,
    signup_date DATE,
    last_login_date TIMESTAMP
);

CREATE INDEX idx_customer_email ON customers(email);

Step 5: Dimensional Modeling

Identify facts (measurable events) and dimensions (context for facts)
Design star or snowflake schemas for analytical querying

Example:

Fact table: sales_transactions
Dimension tables: dim_customers, dim_products, dim_time

Choosing the Right Modeling Approach: Decision Matrix

When deciding on the best modeling approach for your e-commerce or any predictive analytics project, consider the following factors:

For our e-commerce predictive analytics project, a combination of relational and dimensional modeling would be ideal:

Relational model for transactional data (orders, customer info)
Dimensional model for analytical queries (sales analysis, customer segmentation)

This hybrid approach allows us to maintain data integrity while optimizing for analytical performance.

Implementing Our E-commerce Data Model with dbt

Now that we’ve designed our data model, let’s implement it using dbt (data build tool). dbt allows us to define our data transformations as SQL select statements, making it easy to implement our logical and physical models.

If you need crash course on how dbt works , this is an excellent article to learn the fundamentals.

These dbt models work together to transform our raw data into clean, analyzed, and business-ready datasets. The staging models clean our raw data, intermediate models perform initial transformations, fact and dimension models build our core analytical structures, and mart models create specific views for business users.

a) Staging Models (cleaning raw data):

-- models/staging/stg_customers.sql
SELECT
    customer_id,
    first_name,
    last_name,
    email,
    signup_date
FROM {{ source('raw', 'customers') }}
WHERE email IS NOT NULL

-- models/staging/stg_orders.sql
SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    status
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL

b) Intermediate Models (joining and enhancing data):

-- models/intermediate/int_customer_orders.sql
SELECT
    c.customer_id,
    c.email,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS lifetime_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.email

c) Fact Models (representing business processes):

-- models/facts/fct_daily_sales.sql
SELECT
    DATE_TRUNC('day', o.order_date) AS sale_date,
    p.product_id,
    p.category_id,
    SUM(oi.quantity) AS units_sold,
    SUM(oi.quantity * oi.unit_price) AS revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_order_items') }} oi ON o.order_id = oi.order_id
JOIN {{ ref('stg_products') }} p ON oi.product_id = p.product_id
WHERE o.status = 'completed'
GROUP BY 1, 2, 3

d) Dimension Models (providing context):

-- models/dimensions/dim_customers.sql
SELECT
    customer_id,
    first_name,
    last_name,
    email,
    signup_date,
    COALESCE(total_orders, 0) AS total_orders,
    COALESCE(lifetime_value, 0) AS lifetime_value,
    first_order_date,
    last_order_date,
    DATEDIFF('day', last_order_date, CURRENT_DATE) AS days_since_last_order
FROM {{ ref('int_customer_orders') }}

e) Mart Models (business-specific datasets):

-- models/marts/customer_segmentation.sql
SELECT
    customer_id,
    email,
    CASE
        WHEN lifetime_value > 1000 AND total_orders > 10 THEN 'High Value'
        WHEN lifetime_value > 500 OR total_orders > 5 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS customer_segment,
    CASE
        WHEN days_since_last_order <= 30 THEN 'Active'
        WHEN days_since_last_order <= 90 THEN 'At Risk'
        ELSE 'Churned'
    END AS activity_status
FROM {{ ref('dim_customers') }}

So, There you have it.

Data modeling is the cornerstone of turning raw data into actionable insights. By thoughtfully structuring our e-commerce data, we’ve created a solid foundation for predictive analytics, customer segmentation, and data-driven decision making.

Remember, data modeling is an iterative process. As your business evolves and new data sources emerge, you’ll need to revisit and refine your models. The key is to strike a balance between flexibility for future changes and stability for current analytical needs.

With this comprehensive data model in place, implemented through dbt, our e-commerce predictive analytics project is now poised for success. We’ve transformed our chaotic pile of data into a well-organized library of insights, ready to power everything from marketing campaigns to product recommendations and financial forecasts.

In our next article, we’ll dive into the mechanics of setting up our data pipeline to bring these models to life in our AWS environment. Until then, happy modeling!

Questions? Feedback? Connect with me on LinkedIn or contact me directly at gunjan@bytesandbayes.com!

This article is proudly brought to you by Bytes & Bayes, the consulting firm dedicated to guiding you towards data excellence. We also offer a AI Literacy for business leaders workshop that provides a deeper understanding of how to make your organization READY for AI.