Part 5 -Building Your AI-Ready Data Stack: Masterclass in Data modeling

Gunjan Titiya
7 min readJul 4, 2024

--

10-part series about building your first Data stack from 0 to 1, and be ready for AI implementation.

Hello Readers,

Imagine you’ve inherited a vast, ancient library. But instead of neatly organized shelves, you find mountains of books, scrolls, and papers scattered haphazardly across the floor. This is the state of raw data in many organizations — a wealth of information, but chaotic and difficult to use.

Well modeled data is more like image on the right side, better right ? Data modeling is the art and science of transforming this chaos into a well-curated library where every piece of information has its place, and knowledge can be easily accessed and combined in powerful ways.

Hence, before we jump into next phase of this series, which is building data pipeline and Analytics, we need to model our data into this well curated library, This will be crucial step to build any analytics or machine learning use case so I felt it deserve its own masterclass.

In this masterclass, we’ll embark on a journey through the realm of data modeling, using our e-commerce predictive analytics project as our guide. We’ll explore:

  1. The different types of data models and their purposes
  2. The step-by-step process of building a data model
  3. How to choose the right modeling approach for your needs
  4. Implementing our e-commerce data model with dbt

So, let’s roll up our sleeves and start organizing our data library!

Types of Data Models: From Abstract to Concrete

Just as a library has different levels of organization — from broad subject areas to specific shelf locations — data modeling involves several layers of abstraction. Let’s explore each type:

a) Conceptual Data Model: The Library Catalog

  • Purpose: Provides a high-level view of the main data entities and their relationships
  • Audience: Business stakeholders, project managers
  • Example: In our e-commerce world, this might show entities like “Customer,” “Product,” “Order,” and “Session” and their basic relationships

b) Logical Data Model: The Dewey Decimal System

  • Purpose: Defines the structure of data elements and their relationships, independent of any specific database system
  • Audience: Data architects, business analysts
  • Example: For our “Customer” entity, this would define attributes like customer_id, name, email, and their data types

c) Physical Data Model: The Shelf Layout

  • Purpose: Specifies how data is physically stored in the database, including tables, columns, indexes, and constraints
  • Audience: Database administrators, developers
  • Example: This would include the SQL to create our “customers” table in Redshift, with appropriate indexes and foreign key relationships

d) Relational Data Model: The Card Catalog

  • Purpose: Organizes data into tables (relations) with defined relationships between them
  • Audience: Database designers, application developers
  • Example: Our e-commerce data in tables like “customers,” “orders,” “products,” with foreign key relationships

e) Hierarchical Data Model: The Library’s Organizational Chart

  • Purpose: Represents data in a tree-like structure with parent-child relationships
  • Audience: Developers working with specific hierarchical data structures
  • Example: Not commonly used in e-commerce, but might apply to product categories (Category -> Subcategory -> Product)

f) Network Data Model: The Library’s Social Network

  • Purpose: Represents data as a network of interconnected nodes
  • Audience: Developers working with graph databases or complex relationships
  • Example: Modeling product recommendations based on purchase history and user similarities

g) Dimensional Data Model: The Reading Room

  • Purpose: Optimizes data for analytical queries, often used in data warehousing
  • Audience: Business intelligence developers, data analysts
  • Example: A star schema with a central fact table of sales transactions, surrounded by dimension tables for customers, products, and time

The Data Modeling Process: Building Our E-commerce Library

Now, let’s walk through the process of building a data model for our e-commerce predictive analytics project:

Step 1: Requirements Gathering

  • Interview stakeholders to understand their analytical needs
  • Identify key business processes and metrics
  • Define the scope of the data model

Example:

  • Marketing team needs customer segmentation for targeted campaigns
  • Product team wants to analyze product performance and recommend related items
  • Finance team requires accurate sales and revenue reporting

Step 2: Conceptual Modeling

  • Identify main entities and their relationships
  • Create a high-level entity-relationship diagram (ERD)

Example Entities:

  • Customer
  • Product
  • Order
  • Session
  • Campaign

Step 3: Logical Modeling

  • Define attributes for each entity
  • Specify relationships and cardinality
  • Normalize data to reduce redundancy

Example for Customer entity:

  • Attributes: customer_id, first_name, last_name, email, signup_date
  • Relationships: One Customer has many Orders (1:M)

Step 4: Physical Modeling

  • Choose a database system (e.g., Amazon Redshift)
  • Define tables, columns, data types, and constraints
  • Create indexes for performance optimization

Example SQL for Customer table:

CREATE TABLE customers (
customer_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100) UNIQUE,
signup_date DATE,
last_login_date TIMESTAMP
);
CREATE INDEX idx_customer_email ON customers(email);

Step 5: Dimensional Modeling

  • Identify facts (measurable events) and dimensions (context for facts)
  • Design star or snowflake schemas for analytical querying

Example:

  • Fact table: sales_transactions
  • Dimension tables: dim_customers, dim_products, dim_time

Choosing the Right Modeling Approach: Decision Matrix

When deciding on the best modeling approach for your e-commerce or any predictive analytics project, consider the following factors:

For our e-commerce predictive analytics project, a combination of relational and dimensional modeling would be ideal:

  • Relational model for transactional data (orders, customer info)
  • Dimensional model for analytical queries (sales analysis, customer segmentation)

This hybrid approach allows us to maintain data integrity while optimizing for analytical performance.

Implementing Our E-commerce Data Model with dbt

Now that we’ve designed our data model, let’s implement it using dbt (data build tool). dbt allows us to define our data transformations as SQL select statements, making it easy to implement our logical and physical models.

If you need crash course on how dbt works , this is an excellent article to learn the fundamentals.

These dbt models work together to transform our raw data into clean, analyzed, and business-ready datasets. The staging models clean our raw data, intermediate models perform initial transformations, fact and dimension models build our core analytical structures, and mart models create specific views for business users.

a) Staging Models (cleaning raw data):

-- models/staging/stg_customers.sql
SELECT
customer_id,
first_name,
last_name,
email,
signup_date
FROM {{ source('raw', 'customers') }}
WHERE email IS NOT NULL
-- models/staging/stg_orders.sql
SELECT
order_id,
customer_id,
order_date,
total_amount,
status
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL

b) Intermediate Models (joining and enhancing data):

-- models/intermediate/int_customer_orders.sql
SELECT
c.customer_id,
c.email,
COUNT(o.order_id) AS total_orders,
SUM(o.total_amount) AS lifetime_value,
MIN(o.order_date) AS first_order_date,
MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.email

c) Fact Models (representing business processes):

-- models/facts/fct_daily_sales.sql
SELECT
DATE_TRUNC('day', o.order_date) AS sale_date,
p.product_id,
p.category_id,
SUM(oi.quantity) AS units_sold,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_order_items') }} oi ON o.order_id = oi.order_id
JOIN {{ ref('stg_products') }} p ON oi.product_id = p.product_id
WHERE o.status = 'completed'
GROUP BY 1, 2, 3

d) Dimension Models (providing context):

-- models/dimensions/dim_customers.sql
SELECT
customer_id,
first_name,
last_name,
email,
signup_date,
COALESCE(total_orders, 0) AS total_orders,
COALESCE(lifetime_value, 0) AS lifetime_value,
first_order_date,
last_order_date,
DATEDIFF('day', last_order_date, CURRENT_DATE) AS days_since_last_order
FROM {{ ref('int_customer_orders') }}

e) Mart Models (business-specific datasets):

-- models/marts/customer_segmentation.sql
SELECT
customer_id,
email,
CASE
WHEN lifetime_value > 1000 AND total_orders > 10 THEN 'High Value'
WHEN lifetime_value > 500 OR total_orders > 5 THEN 'Medium Value'
ELSE 'Low Value'
END AS customer_segment,
CASE
WHEN days_since_last_order <= 30 THEN 'Active'
WHEN days_since_last_order <= 90 THEN 'At Risk'
ELSE 'Churned'
END AS activity_status
FROM {{ ref('dim_customers') }}

So, There you have it.

Data modeling is the cornerstone of turning raw data into actionable insights. By thoughtfully structuring our e-commerce data, we’ve created a solid foundation for predictive analytics, customer segmentation, and data-driven decision making.

Remember, data modeling is an iterative process. As your business evolves and new data sources emerge, you’ll need to revisit and refine your models. The key is to strike a balance between flexibility for future changes and stability for current analytical needs.

With this comprehensive data model in place, implemented through dbt, our e-commerce predictive analytics project is now poised for success. We’ve transformed our chaotic pile of data into a well-organized library of insights, ready to power everything from marketing campaigns to product recommendations and financial forecasts.

In our next article, we’ll dive into the mechanics of setting up our data pipeline to bring these models to life in our AWS environment. Until then, happy modeling!

Questions? Feedback? Connect with me on LinkedIn or contact me directly at gunjan@bytesandbayes.com!

This article is proudly brought to you by Bytes & Bayes, the consulting firm dedicated to guiding you towards data excellence. We also offer a AI Literacy for business leaders workshop that provides a deeper understanding of how to make your organization READY for AI.

--

--

Gunjan Titiya
Gunjan Titiya

Written by Gunjan Titiya

Founder Bytes & Bayes | Data and AI strategy consultant | Speaking, Writing about data world | "LLM models wont improve company's bottom line, their Data will."