Techniques for Indexing Low-Selectivity Columns in SQL Server

This method will probably produce some duplicates, but in the context of what we are testing, it makes no difference. (The technique for random selection is derived from one posted on Pete Freitag’s Web site and elsewhere.)

We should end up with a distribution where about 25% of the rows in the table have both “flags” set to 1. It’s possible to verify that with a query like this:

Select

     ( select count(*) from basetable ) as Total,

     ( select ( select count(*) from baseTable where validAddr = 1 )* 100.0

      / ( select count(*) from basetable ) ) as PercentValidAddr,

     ( select ( select count(*) from baseTable where getsMail = 1 )* 100.0

      / ( select count(*) from basetable ) ) as PercentGetsMail,

     ( select ( select count(*) from baseTable where getsMail = 1 AND validAddr = 1 )* 100.0

      / ( select count(*) from basetable ) ) as PercentBoth

We have enough rows, at 5 million, to hit the 50% / 50% / 25% almost exactly, just through probability.

Measure Baseline Query Performance

The first step in optimizing this is to get a baseline for the query. For this, I use the Display Execution Plan option in Management Studio and

set statistics io on

GO

select fname, lname from baseTable where validAddr = 1 and getsMail = 1

As expected, this results in a Clustered Index Scan, execution plan cost = 25.4, reads = 21561. The database is examining the entire table to fish out the 25% that match our criteria. It seems like there is some extra work going on here that ought not to be required, but it’s tricky to avoid that extra work.

Indexing

So what about a conventional index? We suspect it will not help, because it would not be selective enough, but it’s important to really test that assumption:

create nonclustered index IX_Flags on baseTable( validAddr, getsMail )

GO

select fname, lname from baseTable where validAddr = 1 and getsMail = 1

There’s no advantage here. SQL Server still scans the table, at the same cost; as we suspected, the index is not selective enough to be of any use, and we are asking for too many rows for it to be practical to look them up using an index.

drop index baseTable.IX_Flags

Next, it may be possible to create a covering index instead, ordered such that all the data for the records we are interested in is located together in one range. There is a challenge that makes this a less attractive option: the index will be essentially as large as the table itself, so we’ve doubled, or nearly doubled, our storage requirement. Ordered correctly, such an index would group the records we care about together, but we would be storing all of them when we are actually interested in only 25%. If we use a covering index, ideally we’d want to index only a subset of the rows rather than all of them. But that’s not possible with a conventional index.

Continues…

Leave a comment

Your email address will not be published.