Clustered Indexes in SQL Server: Things You Need to Know

This article covers a few, more advanced topics about the usage of clustered indexes in SQL Server. Not only will I try to convince you of the absolute necessity of using clustered indexes, I’ll also give you some tips on how to use them in a not so obvious context.

Why You Need Clustered Indexes!

As you know, the SQL Server engine treats a table without a clustered index, also called a heap, differently from a clustered table. In contrast with a heap table, a clustered table has its data pages ordered in the sequence of its index structure and has its data pages linked in a linked list. A heap table uses the IAM and PFS pages to find a page with enough space to hold a new row if no space is available in the current page (more info in BOL).

To illustrate this difference in behavior, I created a small script you can run for yourself.

— Init Use tempdb SET NOCOUNT ON IF OBJECT_ID(‘dbo.TestTable’, ‘U’) IS NOT NULL DROP TABLE dbo.TestTable
CREATE TABLE dbo.TestTable ( GUID_col_indexed uniqueidentifier, CHAR_col char(10), GUID_col uniqueidentifier)
CREATE CLUSTERED INDEX IX1 ON dbo.TestTable ( GUID_col_indexed )
GO
— Step 1: Load TestTable with 100.000 records
declare @t int
set @t=1
while @t <= 100000
begin
insert TestTable values (newid(), char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65)+ char(rand()*26 + 65), newid())
select @t=@t + 1
end
go
— Step 2: Select 100 random values
select top 100 * from TestTable order by GUID_col
go
— Step 3: Delete 1000 random values
delete TestTable where GUID_col in (select top 1000 GUID_col from TestTable order by GUID_col desc)
go
— Step 4: Insert 1000 more values
declare @t int
set @t=1
while @t <= 1000
begin
insert TestTable values (newid(), char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65)+ char(rand()*26 + 65), newid())
select @t=@t + 1
end go

This script is split into 4 steps:

  1. Load 100,000 records into a table using randomly generated values for all fields.
  2. Select 100 random values.
  3. Delete 1,000 random values.
  4. Load 1,000 more records.

To generate random values, I used uniqueidentifiers. I did not use the indexes column as SARG’s in the queries. I did this to avoid any advantages related to the physical structure of “random” values.

You should run this script twice, first by using the script exactly as it appears above (creating the index as clustered) and then by leaving out the [CLUSTERED] keyword. You will get results —–I hope— comparable to the results I got after a few test runs:

CPU

Reads

Duration

Heap

Cluster

Heap

Cluster

Heap

Cluster

— Step 1: Load Table with 100.000 records

5094

4063

374627

281336

5640

4926

— Step 2: Select 100 random values

78

78

671

942

97

84

— Step 3: Delete 1000 random values

641

719

9551

7698

811

834

— Step 4: Insert 1000 more values

218

47

5717

3076

210

45

Even without addressing the specific characteristics of the clustered index, these results clearly show the internal overhead related to heap structures.

Clustered Indexes on uniqueidentifier Columns: Out of the Question! Or Not?

A lot of whitepapers don’t encourage you to use clustered indexes on a uniqueidentifier column. Because of its random generation, records will be inserted randomly over all data pages that are part of the file group your table belongs to. This is absolutely an overhead because a lot more extends will be swapped into and out of memory, page splits are more likely to occur, etc.

But in some VLDB environments with high transactional throughput, this can be a solution when IO throughput is a potential bottleneck. Suppose you have a large (>100 GB) database table using a file group spread over 10 physical disks with a clustered index on a [date-inserted] column. When many inserts occur in combination with lots of updates of recently added records, all data throughput will be focused on a set of physically grouped data pages producing a hot spot. The volumes of dirty pages a checkpoint process wants to flush to disk can get too high for one disk to handle. This is where a clustered index on a uniqueidentifier column can help you. Instead of flushing all dirty pages to one disk, IO’s will be spread over all physical disks.

Pay special attention to fragmentation using this solution. Because of its random inserts, page splits are more likely to occur causing more fragmentation. When your data volumes are too high for a daily rebuild, perform a general index rebuild every week. I personally prefer the CREATE INDEX … WITH DROP_EXISTING statement to rebuild indexes.

Conclusion

  1. Always use a clustered index on an OLTP database table.
  2. If you consider using a clustered index on a uniqueidentifier column:
    1. Test your solution in a stress test environment.
    2. Rebuild indexes regularly.
]]>

Leave a comment

Your email address will not be published.