Why not to use unique identifiers as SQL server clustered keys

Published in

Julius Baer Engineering

5 min readJan 11, 2023

Clustered index

Indices are used to improve the speed of database operations of a table containing many rows. Much like a book index, which allows a reader to easily jump to a specific topic without having to skim through the entire book, a database index facilitates the location of a specific row of a table without the need to scan through the entire table. There are two kinds of indices: a clustered index, and a non-clustered index. A clustered index defines the order in which data rows are physically stored on a disk. A non-clustered index does not require that the data rows be ordered physically. In this blog, I want to take a close look at the clustered index of an SQL Server and its performance implications when suboptimal cluster values are used.

What is a page split?

In an SQL server, data rows are stored as a series of 8 KB pages. A page is the smallest allocatable storage unit. Within a page, data rows are stored in slots. When a row needs to be stored on a certain page and that page doesn’t have enough storage space, the SQL server allocates a new page and moves some rows of the existing page to the newly allocated page to make room for the new row. This process is called a page split.

In the following example, we create a ‘SampleTable’ with a clustered index on the ‘Code’ column. Each row of the table has an approximate size of 1.8 KB, so four rows can be stored on each page. We use some scripts to illustrate the page split. To make the physical location of a row visible, we use the stored function ‘sys.fn_PhysLocFormatter’ of the SQL Server.

CREATE TABLE SampleTable (Id int Identity(1,1) NOT NULL, Code nvarchar(10) NOT NULL, Data nvarchar(900) NOT NULL)
CREATE CLUSTERED INDEX IX_SampleTable_Code ON SampleTable (Code)
INSERT SampleTable (Code, [Data]) VALUES(‘B’, replicate(‘A’,900))
INSERT SampleTable (Code, [Data]) VALUES(‘A’, replicate(‘A’,900))
INSERT SampleTable (Code, [Data]) VALUES(‘E’, replicate(‘E’,900))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

The result shows that rows are indeed physically stored on the same page and, within the same page, they are ordered according to the sorting order of the ‘Code’ column, as indicated by their slot numbers.

INSERT SampleTable (Code, [Data]) VALUES(‘D’, replicate(‘D’,900))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

Inserting a new row with ‘D’ in its ‘Code’ column causes the existing row with ‘E’ in its ‘Code’ column to be relocated from its previous slot 2 to slot 3. And the new row is then stored in slot 2.

INSERT SampleTable (Code, [Data]) VALUES(‘C’, replicate(‘C’,900))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

Inserting another row with ‘C’ in its ‘Code’ column causes a page split, since there is not enough storage space on page ‘16192’. This forces the SQL server to allocate a new page with the page id ‘16194’, relocate rows ‘B’, ‘D’ and ‘E’ to the newly allocated page, and store row ‘C’ on page “16194”.

Why do page splits hurt performance?

When a page split occurs, existing rows on a page need to be relocated. Whenever a row on a page is relocated, all indices of that row also need to be updated, and all relocation operations are logged on the transaction log. Thus, page split is expensive in terms of disk IO. Page split also causes index fragmentation. Fragmentation occurs when there is a lot of unused storage on a data page (internal fragmentation) or when there is a mismatch between the logical order and physical order of the pages. Fragmentation can cause SQL servers to read more pages than necessary and waste valuable memory that, could otherwise be used for caching. To sum up, page split slows down insert and update operations, whereas index fragmentation slows down query operations.

Can page splits be avoided?

In this example, we will recreate the same ‘SampleTable’. But the clustered index of this table is now in the ‘Id’ column, which is an auto-incremented identity column. And we use the same insertion statements to insert the same data into the table.

CREATE TABLE SampleTable (Id int Identity(1,1) NOT NULL, Code nvarchar(10) NOT NULL, Data nvarchar(900) NOT NULL)
CREATE CLUSTERED INDEX IX_SampleTable_Code ON SampleTable (Id)
INSERT SampleTable (Code, [Data]) VALUES(‘B’, replicate(‘A’,900))
INSERT SampleTable (Code, [Data]) VALUES(‘A’, replicate(‘A’,900))
INSERT SampleTable (Code, [Data]) VALUES(‘E’, replicate(‘E’,900))
INSERT SampleTable (Code, [Data]) VALUES(‘D’, replicate(‘D’,900))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

The result shows that the physical records of the rows are now sorted by Id, as specified by the clustered index.

INSERT SampleTable (Code, [Data]) VALUES(‘C’, replicate(‘C’,900))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

Inserting row ‘C’ into the table causes a new page ‘24290’ to be allocated. But none of the existing records on page ‘24288’ were relocated. That’s because the SQL server, knows that Id is of an auto-incremented identity, and no other values can be inserted on page ‘24288’. It therefore just needs to allocate a new page for the row with the Id 5. So, by using strictly monotonic increasing values as cluster keys, resource-intensive page splits can be avoided.

Another way to reduce the number of page splits is by specifying a fill factor value for the clustered index. When an index is rebuilt, the fill-factor value determines the percentage of space on each leaf-level page to be filled with data. It is important to note that the fill factor value doesn’t impact the update or insert operation. It is only used for rebuilding or reorganising an index.

CREATE TABLE SampleTable (Id int Identity(1,1) NOT NULL, Code nvarchar(10) NOT NULL, Data nvarchar(1000) NOT NULL)
CREATE CLUSTERED INDEX IX_SampleTable_Code ON SampleTable (Code) WITH (Fillfactor = 70)
INSERT SampleTable (Code, [Data]) VALUES(‘B’, replicate(‘A’,1000))
INSERT SampleTable (Code, [Data]) VALUES(‘A’, replicate(‘A’,1000))
INSERT SampleTable (Code, [Data]) VALUES(‘E’, replicate(‘E’,1000))
INSERT SampleTable (Code, [Data]) VALUES(‘D’, replicate(‘D’,1000))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

The result shows that the fill factor has no impact on inserts. All four rows are stored on the same page, which exceeds 70% as specified by the index.

ALTER INDEX IX_SampleTable_Code ON SampleTable REBUILD
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

After the index rebuild, the previous almost 100%-filled page (8440) is now split into two pages (16536,16544), obeying the 70% fill factor rule of the index.

INSERT SampleTable (Code, [Data]) VALUES(‘AB’, replicate(‘AB’,500))
INSERT SampleTable (Code, [Data]) VALUES(‘BC’, replicate(‘BC’,500))
SELECT sys.fn_PhysLocFormatter(%%physloc%%) As ‘Physical Location (file, page, slot)’, * FROM SampleTable

After the index rebuild, new rows can be inserted without causing a page split. For the fill factor to reduce the number of page splits, the index needs to be rebuilt or reorganised regularly.

Although a correctly chosen fill factor value and an index rebuild interval can reduce potential page splits by leaving enough storage space on a page for row insertion, this also increases index fragmentation at the same time. Using a fill factor is therefore a suboptimal solution.

Conclusion

A clustered index is a type of index that, defines the physical ordering of the records on the disk. It boosts the performance of range queries and sorting. Ideally, strictly monotonic increasing values should be used as cluster keys. The use of random values, such as unique identifiers, as cluster keys can cause page splits. Excessive page splits cause excessive disk IOs and thus hurt the performance of a SQL server. By default, the SQL server creates a clustered index on the primary key. Therefore, we should be aware of the performance implications when choosing a unique identifier as a primary key of a table.

Why not to use unique identifiers as SQL server clustered keys

Written by Datong Xu