Most Asked Data Modeling Questions in 2024

Solon Das

Published in

Towards Data Engineering

9 min readApr 13, 2024

Medium Level Questions

How do you decide between a star schema and a snowflake schema?

Star Schema: Suitable for scenarios where simplicity and ease of understanding are important. For example, in a retail setting, a star schema might have a fact table for sales transactions connected to dimension tables for products, customers, and time.
Snowflake Schema: Useful when there’s a need for more normalization and reduction of redundancy. In a snowflake schema, the product dimension might be further normalized into sub-dimensions like product category and product subcategory.

2. Describe the process of dimensional modeling ?

Identify business processes: Understand the business’s key activities and how data relates to them.
Identify dimensions and facts: Determine the entities (dimensions) and measurable data (facts) important to the business processes.
Normalize dimensions: Decide whether to normalize dimensions or keep them denormalized based on query patterns and performance considerations.
Create dimension and fact tables: Design tables to represent dimensions and facts.
Establish relationships: Define how dimensions and facts are related, usually through foreign keys.
Optimize for query performance: Consider indexing, partitioning, and other optimization techniques to improve query performance.

3. What are the advantages and disadvantages of using NoSQL databases over traditional SQL databases?

Advantages: NoSQL databases like MongoDB offer flexible schema, which allows for easier adaptation to changing data requirements. They also offer scalability for large datasets and better performance for certain types of queries, such as those requiring aggregation.
Disadvantages: NoSQL databases may lack strong consistency guarantees (e.g., eventual consistency), which can lead to data integrity issues. They also typically have less mature tooling compared to SQL databases.

4. How do you handle slowly changing dimensions in a data warehouse?

Type 1: Overwrite existing data. For example, if a customer changes their address, update the address directly in the customer dimension table.
Type 2: Add new record with new surrogate key. For example, if a product’s price changes, add a new record for the product with the new price and a new surrogate key, while retaining the old record for historical purposes.
Type 3: Add columns to track changes. For example, adding columns like “previous address” and “date of change” to the customer dimension table.

5. Explain the concept of surrogate keys and their role in data modeling?

Surrogate keys are artificially generated keys used as primary keys in a database. They are useful because they are unique and have no inherent meaning, which can simplify joins and ensure data integrity. For example, in a table of customers, a surrogate key could be an auto-incremented integer that uniquely identifies each customer record.

6. What is a data mart, and how does it differ from a data warehouse?

Data Mart: A data mart is a subset of a data warehouse that is focused on a specific business line, department, or team. It contains a subset of the data warehouse’s data that is relevant to that specific group.
Data Warehouse: A data warehouse is a central repository for all organizational data. It contains data from various sources and is designed to support analytical and reporting needs across the organization.

7. Describe the ACID properties of database transactions and why they are important?

Atomicity: Ensures that either all operations within a transaction are completed successfully, or none are. For example, in a banking application, transferring money from one account to another should either complete in full or not at all.
Consistency: Ensures that the database remains in a consistent state before and after the transaction. For example, if a transaction violates a constraint, it should not be committed.
Isolation: Ensures that transactions are isolated from each other, so they do not interfere with each other’s operations. For example, two transactions modifying the same data should not affect each other.
Durability: Ensures that once a transaction is committed, its effects are permanent and cannot be undone, even in the event of a system failure.

8. How do you ensure data integrity in a database?

Use constraints such as primary keys, foreign keys, and unique constraints to enforce data integrity rules.
Perform regular data validation checks to ensure data consistency and accuracy.
Use transactions to ensure that operations are atomic and that data remains in a consistent state.

9. Explain the concept of referential integrity and how it is enforced in a database?

Referential integrity ensures that relationships between tables remain valid. It is typically enforced through foreign key constraints, which ensure that values in a column (or columns) of one table match the values in another table’s primary key column.

10. How do you optimize a database schema for query performance?

Use appropriate indexing, including primary, foreign, and composite indexes, to speed up query execution.
Normalize or denormalize the schema based on query patterns and performance requirements. Normalize for data integrity and reduce redundancy, and denormalize for improved query performance on frequently joined tables.
Partition large tables to distribute data and queries evenly across storage resources.
Use materialized views to store precomputed query results for faster access.

11. Describe the differences between vertical and horizontal partitioning?

Vertical partitioning involves splitting a table into smaller tables based on columns. This can improve query performance by reducing the amount of data that needs to be accessed for certain queries.
Horizontal partitioning involves splitting a table into smaller tables based on rows. This can improve performance by distributing data across multiple storage devices or servers, allowing for parallel processing of queries.

12. What are the common pitfalls to avoid when designing a database schema?

Overlooking normalization/denormalization: Failure to properly normalize or denormalize the schema can lead to data redundancy or poor query performance.
Ignoring query patterns: Designing a schema without considering how the data will be queried can result in inefficient queries and suboptimal performance.
Not considering scalability and performance from the beginning: Failure to design the schema with scalability and performance in mind can lead to difficulties when the system needs to scale up to handle larger datasets or increased traffic.

13. How do you model hierarchical data in a relational database?

Use parent-child tables to represent the hierarchical relationships between entities. Each row in the table represents a node in the hierarchy, and the table contains columns to store the parent-child relationships.
Use nested sets or closure tables for more complex hierarchies where nodes can have multiple parents or children.

14. How do you ensure scalability and flexibility in your data models?

Use flexible schema designs, such as NoSQL databases, that allow for schema-less or schema-on-read approaches to accommodate changing data requirements.
Use sharding to horizontally partition data across multiple servers to distribute the load and improve scalability.
Consider data partitioning strategies, such as range partitioning or hash partitioning, to optimize data access and distribution.

Hard Level Questions :

How would you design a schema to handle multi-tenancy in a database?

Use a single schema with a tenant ID column in each table to differentiate data belonging to different tenants.
Use separate schemas for each tenant, with each schema containing tables specific to that tenant.

2. Describe the differences between vertical and horizontal scaling in the context of database design ?

Vertical scaling involves increasing the capacity of a single server, such as by adding more CPU, memory, or storage resources to the server.
Horizontal scaling involves adding more servers to distribute the load, such as by using a load balancer to distribute incoming requests across multiple servers.

3. How do you design a schema to support real-time analytics?

Use columnar storage formats, such as Apache Parquet or Apache ORC, to store data in a way that is optimized for analytics queries.
Use stream processing frameworks, such as Apache Kafka or Apache Flink, to ingest and process data in real-time.

4. Explain the concept of data skew and how you would address it in a distributed database system?

Data skew occurs when certain partitions or nodes in a distributed database have significantly more data than others, leading to performance issues.
Address data skew by using data partitioning strategies that evenly distribute data across partitions or nodes, such as range partitioning or hash partitioning.

5. Describe the process of designing a schema for a data lake architecture?

Use a schema-on-read approach, where data is stored in its raw format and schema is applied when querying the data.
Use a data catalog to manage metadata and schema information for the data lake.

6. How do you model complex relationships between entities in a database?

Use junction tables to represent many-to-many relationships between entities.
Use inheritance or subtyping to represent entity subtypes and their relationships.

7. Explain the concept of data virtualization and how it can be used in data modeling ?

Data virtualization is the process of abstracting and presenting data from multiple sources as a single, unified data source.
Data virtualization can be used in data modeling to combine and query data from different sources without physically integrating the data.

8. How would you design a schema to handle hierarchical data structures, such as XML or JSON documents?

Use nested structures or document databases, such as MongoDB or Couchbase, to store and query hierarchical data.
Use specialized data types, such as XMLType in Oracle or JSONB in PostgreSQL, to store and query XML or JSON data efficiently.

9. Describe the differences between a star schema and a snowflake schema in the context of Big Data analytics?

In the context of Big Data analytics, a star schema may be more suitable for simpler, more predictable queries where denormalization can improve query performance.
A snowflake schema may be more suitable for more complex, ad-hoc queries where normalization can reduce redundancy and improve storage efficiency.

10. How do you ensure data privacy and security in a database schema?

Use encryption to protect sensitive data at rest and in transit.
Implement access controls to restrict who can access and modify data.
Use auditing and logging to track access and modifications to data.

11. Explain the concept of sharding and how it can be used to improve database performance?

Sharding is a technique used to horizontally partition data across multiple servers.
Sharding can improve database performance by distributing the load across multiple servers, allowing for parallel processing of queries.

12. How would you design a schema to handle streaming data in real-time?

Use stream processing systems, such as Apache Kafka or Apache Flink, to ingest and process streaming data in real-time.
Use specialized storage systems, such as Apache Cassandra or Amazon Kinesis, to store and query streaming data efficiently.

13. Describe the process of designing a schema for a graph database ?

Use nodes to represent entities and edges to represent relationships between entities.
Use graph traversal algorithms, such as breadth-first search or depth-first search, to query graph databases.

14. How do you handle schema evolution in a database system?

Plan for schema changes from the beginning by designing flexible schemas that can accommodate future changes.
Use versioning and migration scripts to manage schema changes and ensure data compatibility across different versions of the schema.

15. Explain the concept of data vault modeling and its advantages in data warehousing ?

Data vault modeling is a methodology for designing data warehouses that focuses on capturing and storing historical data changes.
Data vault modeling uses hubs, links, and satellites to model data, with hubs representing business entities, links representing relationships between entities, and satellites containing historical data changes.

16. How would you design a schema to support geospatial data?

Use spatial data types, such as geometry or geography, to represent geospatial data in the database.
Most modern databases offer specific data types for storing geospatial data. For example:
PostGIS extension for PostgreSQL provides geometry and geography types.
SQL Server has geometry and geography types.
MongoDB has GeoJSON objects.
Model Spatial Entities: Determine the spatial entities you need to represent, such as points, lines, or polygons. Each entity type will correspond to a column in your database table.
Use spatial indexes to optimize queries that involve geospatial data, such as proximity searches or spatial joins.

17.Describe the differences between a star schema and a normalized schema in the context of performance optimization?

In the context of performance optimization, a star schema may be more suitable for OLAP (Online Analytical Processing) workloads where query performance is critical and data integrity can be maintained through other means, such as data validation checks.
A normalized schema may be more suitable for OLTP (Online Transaction Processing) workloads where data integrity is paramount and the focus is on transactional processing rather than analytical querying.

18.How do you design a schema for a time-travel feature, where you can query data at any point in time?

Use temporal databases or implement custom temporal tables to store and query historical data changes.
Keep track of historical data changes using effective dating or versioning techniques.

19. Explain the concept of data mesh and its implications for data modeling in a modern data architecture ?

Data Mesh is a decentralized approach to data architecture that focuses on domain-driven distributed data ownership.
In a data mesh architecture, data models are designed to be flexible and adaptable to changing business requirements, with data ownership and governance distributed across different domains or teams.

Most Asked Data Modeling Questions in 2024

Medium Level Questions

Hard Level Questions :

Written by Solon Das