Database design can be a tricky and laborious project and is often associated with stringent timelines. Below are some questions that you need to have answered before starting a database design initiative. In this article, we go through each of the below questions in a detailed manner.
•What are some things you should plan while doing your database design?
•What are the steps involved in a database design project?
•Should indexing be done before loading data into the database or after loading data into the database?
•When to use Row-store Indexing/When to use Column-store Indexing?
•What are some nice features of SQL Server with respect to Indexes and Constraints?
•Should I use only one type of indexing (Rowstore/Columnstore) throughout?
•Which should be built first? — Clustered Index or Non-Clustered Index
•How do I tackle issues with fragmentation caused by indexing?
Let’s see the answers to these questions in a detailed manner.
What are some things you should plan while doing your database design?
The following are some aspects that should be planned while planning your database design -
•Identify constraints, indexes and relationships
•Note that indexes are only a best-case estimate and will have to be tweaked after database load
•Determine whether your tables need a Row-store or a column-store.
•You might have to do a mix of Row-store indexing and column-store indexing based on the need for each table.
•As a best practice, non-clustered indexes must be created on all foreign keys in a table.
Moving on to the next -What are the steps involved in a database design project?
A database design project involves the following steps -
Ø Conceptual Database Design
A conceptual database design/conceptual data model is a high-level conceptual model which does not include any details about the database structure. It is created by business stakeholders and data architects and is intended for business audience. It is independent of any infrastructure/capabilities like data storage, databases, etc. It is primarily designed as a high-level abstract document in order to showcase data flow to a real-world business user.
Ø Logical Database Design
A logical data model is usually created using the Conceptual Database design model (or some other specific data model). In this stage, the design document will contain the relationships between tables and will also contain datatypes for the attributes. In most cases, the data is 3NF normalized at this stage.
Ø Physical Database Design
This is the next step from the logical database design and is related to implementation on secondary storage. This is a more detailed document having details on base relationships, indexes, constraints, and so on.
Ø Application Design
Application Design is a parallel activity to the Database Design. As the name implies, this is the design of the user interface and the applications and programs that are used by the database and those that are used to process (e.g. feed the database) the database.
The schematic below shows the different steps involved in a database planning activity. (This could be done in any of the above design phases.)
Moving on to the next question- Should indexing be done before loading data into the database or after loading data into the database?
•When we start creating the Database Design document, we plan for Indexes, Constraints based on the available business rules, and based on the relationships identified between the tables. At this point, we do not know how data will be accessed once data has been loaded. This is a best case estimate of what indexes to have. Hence, this is called ‘best effort’ estimation of indexing plans.
•Once data is loaded and we have users accessing the data, we start understanding data access patterns. This is when we add more indexes as required.
•Hence, index planning will happen both before and after loading data into the tables of the database.
Let’s move to the next question- When to use Row-store Indexing/When to use Column-store Indexing?
Before we see the answer to this question, let’s understand Row-store indexes and column-store indexes.
Both Row-store and Column-Store are data storage formats provided by Microsoft SQL Server. Until before 2012, data storage was always in a row-store. In 2012, Microsoft introduced a new format called Column Store promising increased compression rates. Both types of indexes (1. clustered 2. non-clustered) can be done in both types of storage formats.
Let’s understand row-store and column-store indexes with an analogy -
•Imagine that you have a file of size 5GB. Let’s assume that the file contains the columns City, State and Country. There are 1 million records. You zip the file and send it to person X, your colleague who wants to see the number of occurrences of a city. X unzips it and looks for records containing the city of ‘New York’. The search takes around 10 minutes as each of the 1 million records needs to be scanned. This is analogous to a table scan without indexing.
•Let’s say that now you decide to split the data files into 1000 files with a size of 5 MB each. You send this to Person X. Now, Person X opens each 5 MB file and looks for records with city ‘New York’. The search is faster than the previous one. However, there is the work of having to unzip and open each file 1000 times. This is analogous to a row store storage format.
•Now, you decide to split the data files not by their records but by their columns. Now, we will have 3 files one for each column. Person X now can open the necessary file (in this case the ‘city’ file) just once and search for New York to see the number of occurrences. This is analogous to a column store storage format.
Row-store (using the analogy above)
Column-store (using the analogy above)
Column-store indexes have some limitations in the datatypes that they support. The following are some unsupported datatypes in Column-store indexes -
•ntext, text, and image
•nvarchar(max), varchar(max), and varbinary(max) (Applies to SQL Server 2016 (13.x) and prior versions, and nonclustered columnstore indexes)
•rowversion (and timestamp)
•CLR types (hierarchyid and spatial types)
•uniqueidentifier (Applies to SQL Server 2012 (11.x))
Having learnt about the Row-store and Column-store formats, let’s see when to use Row-store and when to use Column-store indexes.
Row-store Indexing can be used in cases where the query involves highly selective queries. A highly selective query is that which involves a well-defined predicate i.e., ‘where’ clause or ‘join’ that results in very few query results. E.g. Searching for a student by the Student ID.
Column-store Indexing can be used for low selective queries. A low selective query is that which uses aggregate operations such as Sum, Avg, etc. and that results many matching results. E.g Querying the total marks for all subjects by all students of a certain grade in a particular year.
When bulk-insert operations need to be done, it is better to use Row-store indexes. When update operations are involved, it is better to use Column-store indexes. Column-store indexes are recommended for tables that are large and wide. It is better to avoid indexing for small tables.
What are some nice features of SQL Server with respect to Indexes and Constraints?
When a Primary key is defined, SQL Server automatically creates a clustered index on the primary key. This doesn’t imply that for a clustered index to exist, the primary key should always exist. Clustered indexes can be created even if there is no primary key in the table.
When a unique constraint is defined, SQL Server automatically creates a non-clustered index on that column. An index can be created while creating a table or can be added later.
Should I use only one type of indexing (Rowstore/Columnstore) throughout?
The answer is ‘No’. A mix of Row-store and Column-store indexes can be used wherever necessary in the database.
Which should be built first? — Clustered Index or Non-Clustered Index
There are two ways to work with clustered and nonclustered indexes in database tables.
1. Creating clustered indexes and then creating multiple nonclustered indexes on tables.
2. Creating multiple nonclustered indexes, creating clustered indexes on tables, and then rebuilding the nonclustered indexes.
Out of these two ways, the second way is a recommended practice that can be adopted in a real-time environment. Why?
SORT_IN_TEMPDB — determines whether the sorted results are to be stored in the TEMPDB.
STATISTICS_NORECOMPUTE — When it is ON, SQL Server does not automatically recompute outdated data statistics for that particular index.
- How do I tackle issues with fragmentation caused by indexing?
Before delving into this, let’s see how to identify data objects that are fragmented.
In this case, I am trying to find fragmented data objects on the ‘AdventureWorks’ database.
The above query shows the percentage of fragmentation in SQL Server.
In order to tackle issues with fragmentation caused by indexing, one needs to re-organize or rebuild indexes. This is done using the REORGANIZE/REBUILD commands.
LOG_COMPACTION = ON ensures that the pages of Large Object data (LOB) datatypes such as VARCHAR(MAX),VARBINARY(MAX),NVARCHAR(MAX),TEXT,NTEXT, IMAGE are also compacted.
When re-organize is used for defragmentation, SQL Server compresses the objects using minimal resources. Re-organize physically re-orders the leaf data pages and compacts the indexes accordingly.
Rebuild syntax is given by
This article aims to answer the most basic questions related to database design. There may be a lot more complex and convoluted scenarios pertaining to one’s business that needs to be addressed before database design (may be while doing the Conceptual Database Design phase.