Techniques for Indexing Low-Selectivity Columns in SQL Server
SELECT col1, col2, col3 FROM aLargeTable WHERE flag = 1
On the surface, this seems like an easy thing to solve. Looking closer, though, it’s likely that our offending “flag” column consists of only two values, and is a BIT column — or, what amounts to the same thing, a column of another type containing only a few different values. That means that indexing it in the usual fashion will not work, because the index will not be selective enough to provide the database engine with an advantage when selecting the rows and the optimizer is not likely to even use it. This, incidentally, would be correct behavior, because using the index to look up many individual rows would be slower than just scanning the whole table.
So, what to do? There are a few possible actions here, but some will help and some will not. What follows is an analysis of some techniques, with their performance impact, using SQL Server 2005.
To simulate this situation, we need an example with the following characteristics.
- A large number of rows.
- Some columns that have low or extremely low selectivity, such as BIT columns or CHAR columns with only a few different values.
- The need to select from the table using the low selectivity columns as selection criteria.
I’m going to use a hypothetical mailing list table to emulate this situation. The mailing list consists of people’s names, states, and two bit-column “flags” indicating whether their address is valid, and whether they have “opted in.” (After all, all mailing lists allow people to opt in or out, right? In a perfect world, at least, they would.) The query we want to optimize is one that chooses every row having both of these flag values set to 1/True.
Create Sample Data
Here is the sample table:
CREATE TABLE baseTable ( aKey int identity( 10000000,1 ) PRIMARY KEY, fname VARCHAR(50), lname VARCHAR(100), validAddr BIT, getsMail BIT, state CHAR(2) )
I’ll populate the table with a great many rows of sample data, using random selections from a cross join, broken into manageable chunks by a loop. On my test system, I have two tables of sample first and last names, obtained from the U.S. Census Bureau Web site — handy for producing legible test data — that you’ll see in the select statements below. There is also a table of U.S. states. If you try something like this, be sure you have enough transaction log and disk space, and mind the overall size of the cross join to keep it reasonable:
declare @i int set @i = 0 while @i < 10 — Will yield 10 * 500,000 rows begin insert into basetable select top 500000 f.[name], l.[name], va.val, gm.val, s.code from ( select top 500 [name] from sampDataFNames order by newid() ) as f, ( select top 500 [name] from sampDataLNames order by newid() ) as l, ( select 0 val union all select 1 val ) as va, ( select 0 val union all select 1 val ) as gm, ( select top 5 code from states order by newid() ) as s order by newid() set @i = @i + 1 end