Cracking the Data Modeling Interview: Part 4: Designing Efficient Data Structures

Sean Coyne
10 min readNov 11, 2023

--

Cracking the Data Modeling Interview: Part 4: Designing Efficient Data Structures

You can crush the data engineering interview by learning the tips and tricks in my book, Ace The Data Engineering Interview, on kindle and paperback, and my free companion app on iOS.

As part of the data modeling interview you will be asked to design an efficient database in order to solve a real world problem. In this article we will cover a variety of database design patterns, as well as cover an example problem.

Designing Efficient Data Structures

Fact Tables vs. Dimension Tables:

In the context of data warehousing, fact tables and dimension tables are central elements of star and snowflake schemas. A fact table primarily contains the measurable, quantitative data for analysis, known as facts, such as sales revenue, quantities, or counts. It also has foreign keys that reference associated dimension tables. Dimension tables, on the other hand, include descriptive, textual, or categorical information, typically the attributes of dimensions, like details of products, customers, or time periods. For instance, a dimension table might store details like product names, categories, and manufacturers.

Star and Snowflake Schemas:

In a star schema, the fact table sits at the center, directly linked to a set of dimension tables, forming a pattern reminiscent of a star. The relationships between the fact and dimension tables are typically “many-to-one” with denormalized dimension tables, ensuring simplicity and query efficiency. The snowflake schema is a more normalized version of the star schema, where the dimension tables are often split into related sub-dimensions. As a result, dimension tables in the snowflake schema may have relationships among themselves, resembling a complex, branching snowflake. While this normalization reduces data redundancy, it might increase the complexity of queries when compared to the star schema.

We will use a simple sales scenario as an example for both the star and snowflake schemas. The star schema has a direct, denormalized relationship from the fact table to each of its dimensions. In the snowflake schema, some dimensions are further normalized, leading to branching or “snowflaking” out into other tables, like the Product and Category relationship shown above.

Star Schema:

Fact Table - Sales:
| Sale_ID | Date_Key | Product_Key | Customer_Key | Quantity_Sold | Total_Sale|
| - - - - | - - - - -| - - - - - - | - - - - - - -| - - - - - - - | - - - - - |
| 1 | 101 | 201 | 301 | 5 | $50 |
| 2 | 102 | 202 | 302 | 2 | $40 |

Dimension Table - Date:
| Date_Key | Date | Month | Year |
| - - - - - | - - - - - - | - - - | - - -|
| 101 | 2023–01–01 | Jan | 2023 |
| 102 | 2023–01–02 | Jan | 2023 |

Dimension Table - Product:
| Product_Key | Product_Name | Category | Price |
| - - - - - - | - - - - - - -| - - - - - | - - - |
| 201 | Product A | Category1 | $10 |
| 202 | Product B | Category2 | $20 |

Dimension Table - Customer:
| Customer_Key | Customer_Name | Location |
| - - - - - - - | - - - - - - - | - - - - -|
| 301 | Alice | CityA |
| 302 | Bob | CityB |

Snowflake Schema:

Fact Table - Sales (remains the same):
| Sale_ID | Date_Key | Product_Key | Customer_Key | Quantity_Sold | Total_Sale|
| - - - - | - - - - -| - - - - - - | - - - - - - -| - - - - - - - | - - - - - |
| 1 | 101 | 201 | 301 | 5 | $50 |
| 2 | 102 | 202 | 302 | 2 | $40 |

Dimension Table - Date (remains the same):
| Date_Key | Date | Month | Year |
| - - - - - | - - - - - - | - - - | - - -|
| 101 | 2023–01–01 | Jan | 2023 |
| 102 | 2023–01–02 | Jan | 2023 |

Dimension Table - Product (breaking down the category into another table):
| Product_Key | Product_Name | Category_ID | Price |
| - - - - - - | - - - - - - -| - - - - - - | - - - |
| 201 | Product A | 401 | $10 |
| 202 | Product B | 402 | $20 |

Sub-Dimension Table - Category:
| Category_ID | Category_Name |
| - - - - - - | - - - - - - - |
| 401 | Category1 |
| 402 | Category2 |

Dimension Table - Customer (similar to the star schema's customer table for simplicity):
| Customer_Key | Customer_Name | Location |
| - - - - - - -| - - - - - - - | - - - - -|
| 301 | Alice | CityA |
| 302 | Bob | CityB |

One Big Table Data Model

The “One Big Table” data model is an approach where all the data is stored in a single, large table rather than being distributed across multiple normalized tables. Each row in this table represents a unique record, and each column captures a distinct attribute. While this model can simplify queries and reduce the need for complex joins, it may also introduce redundancy, demand more storage, and potentially degrade performance as the table grows. The One Big Table paradigm is often seen in big data platforms and NoSQL databases where horizontal scalability, flexibility, and speed of writes might be prioritized over traditional normalization rules and relational integrity.

If we were to apply the “One Big Table” model to our sales data, it would look something like this:

Sales Big Table:
| Sale_ID | Date | Month | Year | Product_Name | Category | Price | Quantity_Sold | Total_Sale | Customer_Name | Location |
| - - - - | - - - - - -| - - - | - - -| - - - - - - -| - - - - - | - - - -| - - - - - - - | - - - - - -| - - - - - - - | - - - - -|
| 1 | 2023–01–01 | Jan | 2023 | Product A | Category1 | $10 | 5 | $50 | Alice | CityA |
| 2 | 2023–01–02 | Jan | 2023 | Product B | Category2 | $20 | 2 | $40 | Bob | CityB |

In this “One Big Table” model, all the relevant information about a sale, including the date, product details, and customer details, is present in a single row. As you can see, instead of referring to separate dimension tables or using keys to link to them, every attribute of interest is directly included in the table. While this might simplify query writing, you can also notice potential redundancies. For instance, if Alice made another purchase in January 2023, her name and location would be repeated in the table.

Here’s a breakdown of the pros and cons for the star, snowflake, and One Big Table data models:

Star Schema:

Pros:

1. Simplicity: With denormalized dimension tables, the structure is easier to understand and navigate.

2. Query Performance: Fewer joins are required, which can improve the speed of querying large datasets.

3. Adaptability: It’s relatively easy to add new dimensions without altering existing structure.

Cons:

1. Redundancy: Denormalization can lead to data redundancy in dimension tables.

2. Storage: The redundant data might require more storage space.

3. Maintenance: Redundancy can make updates and inserts more complex, potentially leading to inconsistencies.

Snowflake Schema:

Pros:

1. Normalization: Reduced data redundancy due to the normalization of dimension tables.

2. Storage Efficiency: Uses less storage space compared to the star schema because of reduced redundancy.

3. Clear Structure: The normalized structure can be clearer for those familiar with relational database design.

Cons:

1. Query Complexity: Requires more joins which can make queries more complex and potentially slower.

2. Maintenance: Changes in one table might necessitate changes in related tables.

3. Less Intuitive: For business users, the snowflaked structure can be harder to navigate and understand compared to the star schema.

One Big Table:

Pros:

1. Simplicity: All data is in one place, making it straightforward to query without needing joins.

2. Flexibility: Can easily accommodate new columns or attributes without major structural changes.

3. Write Speed: Often optimized for fast writes, especially in systems like NoSQL databases.

Cons:

1. Redundancy: High potential for data redundancy, especially with repeated attributes.

2. Scalability Concerns: As the table grows, query performance can degrade without proper indexing or partitioning.

3. Maintenance: Ensuring data consistency can be challenging due to the lack of normalization. Potential for anomalies during data modifications.

Each model serves specific use cases and system requirements. The choice of which to use depends on the nature of the application, the importance of query speed versus storage efficiency, and the anticipated future growth and changes to the data.

Example Data Modeling Challenge

During your data modeling interview you will be given a prompt, and you will need to ask clarifying questions in order to design the ideal data model.

Design the data model for a ride sharing app.

If asked to design a data model for a ride-sharing app, there are several clarifying questions they would would want to to consider asking to ensure you are covering all relevant aspects and to understand the depth and breadth of the expected solution:

1. Scope and Specificity:

  • Are we focusing on a specific aspect of the ride-sharing app, or are we considering the entire end-to-end process?
  • Should I include aspects like rider and driver reviews, promotions, or loyalty programs in the model?

2. Scale and Performance:

  • How will this data model be used, will it be used for transactions, or batch analytics?

3. Functionalities and Features:

  • Does the platform support carpooling or ride-sharing with multiple riders?
  • Should the model account for dynamic pricing or surge pricing based on demand?

4. Geographical and Legal Considerations:

  • How detailed should the location data be? Do we need to consider things like zoning, or are general GPS coordinates sufficient?

5. Historical Data and Analysis:

  • Do we need to maintain a history of past rides, driver performance, or pricing changes for analytics purposes?

These questions can help you get a clearer picture of what’s expected and ensure that the data model they design is robust, comprehensive, and aligned with the business and technical goals.

Next you will want to identify and define primary entities within the system. Reflecting on the user journey within such apps, a few entities emerge prominently: The Rider, representing the user, and the Driver or Partner, representing the individual offering the service. When a user interacts with the app, they’re essentially booking a Trip, connecting a Rider with a suitable Cab at a designated Location. Post-trip, there’s usually an exchange involving Payment. While real-world scenarios would extend this model to include aspects like reviews, ratings, and customer support, for the purpose of this interview exercise, let’s concentrate on the main flow up until the ride’s completion. Thus, the core entities to model would include: Rider, Driver, Cab, Trip, Location, and Payment.

Fields

For the Rider entity, fields would typically encompass rider_id as a unique identifier, demographic attributes such as first_name, last_name, date_of_birth, email, phone_number, and current_payment_method which could store details about the rider’s chosen payment method. Given the dynamic nature of payment preferences, the table might also include fields like is_active and updated_at to accommodate Slowly Changing Dimensions (SCD) scenarios, allowing for tracking historical changes in payment methods.

The Driver entity would have fields like driver_id, first_name, last_name, date_of_birth, email, phone_number, and current_cab_id to indicate which cab they’re currently driving. To facilitate the history of vehicles used by the driver, we might also consider fields like start_date and end_date for each cab assignment.

For Cab, the primary attributes would be cab_id, vehicle_type (like sedan, SUV, etc.), registration_number, and base_rate_per_km, indicating the charge rate per kilometer. This rate might vary based on vehicle type or other special amenities the cab offers.

In the Rider_Bookmarks table, fields would mainly focus on the rider_id, a location_id for each saved location, and perhaps a description field where the rider can tag the location with labels like “home”, “work”, or “gym”.

The Map_Grid table could comprise grid_id as a unique identifier, along with fields defining its geographical boundaries like start_latitude, end_latitude, start_longitude, and end_longitude. This grid-based structure streamlines querying by grouping locations into manageable blocks.

For Location, the table might include location_id, name (human-readable form of the location), landmark_type (airport, mall, etc.), latitude, longitude, and related_location_id to link locations that are related but distinct.

Lastly, the Trip entity would have fields such as trip_id, rider_id, driver_id, start_location_id, end_location_id, start_time, end_time, trip_status (like “ongoing”, “completed”), and payment_id to link to payment details, among others.

Relationships

Relationships stitch these entities together to form a cohesive, relational model:

1. Rider to Trip: One-to-many. A single rider can have multiple trips, but each trip is associated with one rider.

2. Driver to Trip: One-to-many. One driver can undertake multiple trips, but each trip has one driver.

3. Driver to Cab: One-to-many with history. Over time, a driver might be associated with several cabs, but at any given time, they’re driving one cab.

4. Rider to Rider_Bookmarks: One-to-many. A rider can bookmark several locations.

5. Location to Rider_Bookmarks: One-to-one. Each bookmarked location relates to one specific location on the map.

6. Map_Grid to Location: One-to-many. A single grid block can encompass multiple locations.

7. Trip to Location: Many-to-many. Trips have both start and end locations, so a location can be the starting point for many trips and the endpoint for others.

This setup creates a comprehensive representation of the ride-sharing app’s data model, catering to various operational and analytical needs.

Here are five more potential data modeling scenarios that could be posed during an interview:

1. E-commerce Platform:

  • Design a data model for an e-commerce platform that supports product listings, user reviews, vendor profiles, and order processing. The platform should also cater to promotional offers and loyalty programs.

2. Hospital Management System:

  • Design a data model for a hospital management system that caters to patient admissions, billing, medical records, pharmacy inventory, and appointment scheduling. Consider scenarios like emergency cases, long-term treatments, and telemedicine consultations.

3. Streaming Service:

  • Develop a data model for an online video streaming platform, similar to Netflix or Hulu. It should handle user profiles, watch history, content libraries, subscriptions, and recommendations based on viewing patterns.

4. Learning Management System (LMS):

  • Design a data model for a Learning Management System where educators can upload courses, students can enroll, and progress is tracked through quizzes and assignments. The system should also accommodate discussion forums and certificates of completion.

5. Real Estate Portal:

  • Develop a data model for a real estate listing portal where sellers can list properties, buyers can search and filter listings, and agents can represent multiple properties. The platform should also support property valuations, virtual tours, and user feedback.

Each of these scenarios requires an understanding of the domain and will challenge you to consider various aspects, including relationships between entities, scale, user experience, and potential future enhancements.

--

--

Sean Coyne

Author of "Ace the Data Engineer Interview", Director Software Engineering at GoodRx, Adjunct Professor at USD