SQL Server Index Architecture

Published in

Trendyol Tech

12 min readAug 1, 2021

Hello everyone! In this article, we are going to lead in the index architecture of the SQL server and talk about index types, design, operations, and options, respectively.

An index is database objects that are defined on database tables and access data faster with fewer actions. A classic example of indexes is a phone book. In case the entries in the phone book are not indexed, that is to say, if each entry is kept random in the phone book, we will have to browse around the entire phone book for a name for which we search. But, if the entries in our phone book were indexed, we could see whether the name we were looking for was ahead of or behind the name in the middle of the phone book. In this way, we could reach the desired result in a few steps by adding the data we are looking for. The objects that allow for keeping the data in order, such as in this example, are called indexes.

Index Principle

In this section, we will touch upon how indexes work. But first, let’s look at how data is stored in the SQL server.

When we create a new database, we indicate the files where our database is. SQL Server splits these files into 8 KB blocks logically, not physically. These blocks are called pages. Hereby, the first 8 KB of the file becomes a page0, and the following 8 KB becomes a page1, and so on. Inside the pages, there are structures that are like the rows in tables and we call them rows. SQL Server makes another logical batching on the pages; a data structure of 64 KB formed by the aggregation of 8 consecutive pages is called extent. When an extent is fully loaded, the subsequent entry is made into a new extent at the size of the entry.

https://www.c-sharpcorner.com/UploadFile/ff0d0f/how-sql-server-stores-data-in-data-pages-part-1/Images/data%20page.jpg

The number of rows on each page is not the same. Pages have rows varying according to data size, and a row can only be on one page. Indeed, SQL Server doesn’t read the rows and instead accesses the data by reading the pages.

When an index is designated to a table in SQL Server, it organizes the data in that table according to a tree structure as follows.

The top node of this tree structure is called the Root level. In fact, it starts from there and tries to find the entry by branching it to the right or left, as in the phone book example. There are Intermediate levels below the root level. While there should be one Root level, there can be several Intermediate levels depending on the amount of data in the table. At the bottom, there are Leaf Nodes, that’s to say, the structures that store the data. The search starts from the top and comes to the bottom level. The data stored in the leaf nodes will vary according to the index type.

https://s33162.pcdn.co/wp-content/uploads/2017/05/word-image-15.png

As seen in the picture above, we can find the data we are searching for in three steps. But, if an index was not used, in other words, the data was not organized in the tree structure as above, the data could be accessed by browsing around all records. It is always required to go down to the leaf level to access the data.

Heap Table

There is no concept called heap table in SQL Server. Indeed, designating a table as a heap depends on whether an index has been defined on it. When adding data to a non-indexed table, the SQL Server does not store it on the disk as indexed, and the data is written to random data pages. Such tables are also called the heaps. In other words, we can say that tables without clustered indexes are heap tables. When data is searched on the heap table, the SQL Server accesses the table entries in sequence and matches them with the entry we are searching for. Even if an entry is found, it compares all entries to see if there is any other entry to match. This operation done by SQL Server is called Table Scan. This operation will take a very long time based on the number of entries in the table. It also has advantages over the tables defined with Clustered Index. These tables do not need extra index maintenance cost and extra space for the clustered index tree structure.

https://www.mssqltips.com/tipimages/1254_heap.gif

Clustered Table

Tables defined with clustered index are called clustered tables. If the query uses index-defined columns, the data is accessed very rapidly. Data pages are inter-connected for rapid access to the data. There is an extra index maintenance cost for INSERT, UPDATE, and DELETE operations compared to heap tables.

Index Types

In principle, the indexes in SQL Server are divided into two as clustered and non-clustered indexes. The data stored in the leaf nodes are designated as clustered while the pointer indicating, on which page the data is stored, is called a non-clustered index.

Clustered Index

As we mentioned above, clustered indexes sort the data in the table physically. Since a table is sorted physically, only one clustered index can be defined on the table. The column(s) to be selected for the clustered index should be the most used columns in the queries. As the data will be physically sorted according to these columns, it is accessed very rapidly. Furthermore, the selected column should be a field that has not been changed much. This is because the change of the column concerning the index means reorganizing the entire index, that’s to say, resorting to it physically. Indeed, the SQL server determines the need for the index itself. It makes its decision whether or not to use the indexes that we will define.

CREATE CLUSTERED INDEX IX_IndexName ON TableName (Column1);

Non-Clustered Index

Non-Clustered Index sorts the data logically, not physically. The leaf nodes of these indexes store the information of where the data is, not the data itself. A maximum of 999 non-clustered indexes can be defined in the table. Non-clustered indexes cannot access data directly. It can be accessed over the heap or a clustered index. This index should be created from the columns we frequently use in the condition section of our query.

CREATE NONCLUSTERED INDEX IX_IndexName ON TableName (Column1);

We have said that there can be a maximum of one clustered index and 999 non-clustered indexes. An index in SQL Server can contain a maximum of 16 columns, and its total size should not exceed 900 bytes. Moreover, any index cannot be defined on large-size fields, namely the columns having varchar(max), nvarchar(max), xml, text, and image types. We have always mentioned the advantages of the index, but its costs are very high. Each time you create an index, a space from your database is occupied. Indexes are required to be reorganized in insert, update, and delete operations, and this adversely affects the table performance. When an index is started to be created on a table, SQL Server locks the table and blocks off access. The index creation process may take a short or long time based on the amount of data in the table. Thus, we should think about it a lot and make decisions carefully in the selection of indexes.

Unique Index

It is used to ensure the deduplication of the data. It prevents duplication of the data and accelerates data retrieval according to the column we define. When we define a primary key or unique constraint for our table, a unique index is automatically defined. When we define this index to several columns, the deduplication occurs not over a single column, but over the columns where it is defined. A null value can be added to the column where it is defined, only once. Both clustered and non-clustered indexes can be defined as unique.

CREATE UNIQUE INDEX AK_IndexName ON TableName (Column1);

Filtered Index

In this index type, instead of defining an index for the entire table, an index is defined for the data that fulfills the specified requirement. It both increases performance and reduces the index maintenance costs. It occupies less space compared to a usual non-clustered index.

CREATE NONCLUSTERED INDEX IX_IndexName ON IndexName (Column1, Column2) WHERE …

Composite Index

If the index has been defined on the table over several columns, not over a single column, this type of index is called composite index. As we mentioned in the index constraints, a composite index is defined up to a maximum of 16 columns in a table, and it should not exceed 900 bytes. Both clustered and non-clustered indexes can be defined as composite. In this index definition, the order in which the columns are written is also very significant. To increase the index performance, the column whose diversity is higher should be written first. In other words, the column having more individual data than the data in the table is written first.

CREATE NONCLUSTERED INDEX IX_IndexName ON IndexName (Column1, Column2)

Covered Index

First, let’s explain why we need this index type. Normally, if the fields we want to access in a query are available in the index definition, we can access the data directly from the leaf nodes with this index. But when we want to retrieve a column or columns different from the index definition, first, the data meeting the index condition is retrieved, and the key value is determined. Then, the values of the columns that are not defined in the index are accessed over this key value. This process is called key lookup.

SQL Server makes an extra key lookup to access data, and this adversely affects I/O performance. You may ask why we don’t define the composite index. We cannot add as many columns as we want, as there is a limit of 16 columns and 900 bytes in the composite index. Even if there is no constraint, having too many columns in our index definition will increase the index size and give rise to performance loss since the index will be reorganized in operations such as updates and inserts.

We can use covered indexes to solve this problem.

CREATE NONCLUSTERED INDEX IX_IndexName ON TableName (Column1) INCLUDE (Column2, Column3);

With the INCLUDE option specified in the index definition, we can add the columns that stay out of the index and are available in our query. While SQL Server is accessing this data, it does not make an extra key lookup and accesses it directly. The INCLUDE option also removes some constraints. If you remember, we cannot add the fields storing big-size data such as varchar(max), nvarchar(max), xml, text, and image to the index definition. They can be added with the INCLUDE option. Moreover, the constraints of 16 column and 900-byte are thereby removed. The INCLUDE option can be only defined with non-clustered indexes.

We should be attentive to select as few columns as possible in our queries and avoid the statement select *. Before filling the include section of the indexes with columns, we should calculate the cost and define the index accordingly. Otherwise, the performance of our queries may decrease further.

Column Store Index

In the index types we have explained so far, indexes are kept on a row basis. In this index type, column-based indexing is done. Storage cost is low compared to other indexes. It is generally used in systems where data is written less but read more and in applications containing big data such as data warehouses where complex filtering and grouping operations are conducted. It compresses up to 15 times more than a non-clustered index and increases the performance up to 10 times in queries.

CREATE COLUMNSTORE INDEX IX_IndexName_ColumnStore ON TableName (Column1, Column2);

Full-text Index

In text-based searches in SQL Server, the performance of the like operator decreases, especially as the text size increases. Apart from that, the 900-byte limit constrains us to create an index, as we have mentioned in the index constraints. We can use this index type to make a fast search, especially in fields containing big data (varchar(max), nvarchar(max), xml, text). That is a service offered by SQL Server. I won’t go into more detail as it is the subject of an article in itself.

Index Design

Up to this section, we have touched upon how to define index types. In summary, these are as follows:

We should choose as few columns as possible in tables with intensive data updating for the index definition.
We can define more indexes in tables with less data updating.
We should define the clustered index in as few columns as possible. Ideally, our clustered index should be defined in a unique column and not contain a null value.
The more repetitive data we have in the column where we define the index, the lower our index performance will be.
We should be attentive to the order of the columns in composite indexes.
An index can be also defined on computed columns as long as they meet the requirements. That’s to say, the computed value must be deterministic.
Columns should be chosen carefully in index definitions due to storage and sorting effects.
The number of columns in the index definition will directly affect the performance of insert, update and delete operations.

Apart from these clauses, you can find more detailed information at this link.

Index Operations

Indexes do not always make a positive contribution to performance, and they may affect it negatively. When we notice that the indexes are not used, we may need to either delete or disable them so that they do not affect performance negatively.

Making indexes passive; does not delete the index cookies from system catalogs but deletes the actual data that the index contains. Making a non-clustered index passive only prevents access to the index, but for clustered indexes, this prevents access to the table data unless it is dropped or rebuilt. We can use this option if we need to delete the index for a short time and then create it again.

ALTER INDEX INDEX IX_IndexName ON TableName DISABLE

Deleting indexes; When clustered indexes are deleted, the data stored in leaf nodes starts to be stored in unsorted heap tables. Both cookie and index data are deleted from the disk. The clustered index defined as the primary key cannot be deleted. First, this constraint should be removed from the table.

DROP INDEX INDEX IX_IndexName ON TableName

Index Options

We use some options when creating indexes. These are as follows;

FILLFACTOR: It is used to adjust the density of the data pages in the leaf nodes where the data is stored. While the incoming new entries are written to the pages, the fill factor is checked, and if there is a space, it is written to the relevant place, but if there is no space, the page is divided into two and organized for the incoming new data. We adjust the fill factor of these pages with FILLFACTOR. The default value is zero and all data pages are filled. Keeping the Fillfactor high will increase the number of pages. When reading the data, we reach more data by reading fewer pages. Keeping this value low, in other words, leaving the pages blank, will raise our performance in the insert process. This value should be well calculated as both cases have positive and negative aspects.
PAD_INDEX: We have stated that the Fillfactor value is implemented in leaf nodes. If we want to apply this value to the nodes at the Intermediate level, we should also use the pad_index option together with the fill factor option.
SORT_IN_TEMPDB: Activating this option means that our index operations will be in the Tempdb system database, not in the database.
IGNORE_DUP_KEY: When we add the same entry to the unique index-defined table again, this reduces the level of the error message and provides a warning with this option. An entry is not added, but if the processing is in a transaction, the transaction is not terminated, other processings are continued.
DROP_EXISTING: It is used when a new index is desired to be created with the identical name to the index name to be created. It deletes the old index and creates a new one.
ONLINE: We have stated that the table was locked while creating the index. With this option, the table is not locked when the index is being created and the data is accessed.
MAXDOP: With this option, we can indicate how many processors on our server will be used for index creation. This value indicates the processor value which will be used most.
DATA_COMPRESSION: This option provides compression of the data of our index, especially when creating large-size indexes.

These are frequently used index options. I have not given a place to all options here. Those who are interested in these can check this link.

There are many more things to mention about indexes. I have tried to explain what I used from a developer’s point of view. I hope I can achieve to come in useful. I will always be glad to get your comments about the article. In any case, you can reach me on Linkedin and Twitter.