Techniques for Indexing Low-Selectivity Columns in SQL Server
Impact on Updates and Inserts
This technique will have some cost in terms of changes to the base table, so it’s important to estimate how much, and whether that impact will be detrimental. The sample data I created happened to include about 150,000 rows having state = “MT”, so to at least get some idea of the update performance hit, I performed some updates on those rows and compared the execution plans. This is not a comprehensive test of inserts and updates, but just a brief check.
First, with the indexed view, a change to the base table has to also be affected in the view’s data. For a statement like this one:
update baseTable set getsMail = 0 where state = ‘MT’
An update has to be performed against both the base table and the view content, and each will show up in the execution plan as a node labeled as a “clustered index update.” It might seem odd, but that is what is happening: the view data is persisted, and has its own clustered index, so that clustered index has to be updated along with the table itself. The cost in this case is 28.79.
Next, I’ll remove the view, and perform a similar update:
drop view completeSubset
update baseTable set getsMail = 1 where state = ‘MT’
The cost to perform this update without the indexed view is 24.70, which is slightly less than the cost when the view is present, as one would expect. But this difference does not seem like a significant impact compared to performance gain we achieved on select queries. If the workload on this hypothetical system favors selects over changes — and many systems do by even a factor of ten — then the benefits are probably worth it. Let me roughly generalize the costs above into a guesstimate, understanding that this is an oversimplification: Suppose that the select statements run five times as often as updates on our system. Our test provided a four-fold improvement in selection, multiplied by five is a 20x weighted gain in selection performance, and a relatively small 0.86x loss in update performance.
Conventional indexes are not much help when the data they index is not selective. However, if you have adequate storage space and data cache available, and the hit in insert/update performance is not an overriding concern, then it is possible to get significant gains in select performance on this type of data by denormalizing into separate indexed views. This type of denormalization is relatively safe, because — unlike denormalized data in tables — the database engine itself maintains the consistency of the data. So there is little or no worry that the data in the indexed views will get out of sync with the base tables, and there is no additional effort or coding required to keep the view data up to date.
Good candidates for this technique include tables with large numbers of rows, where it’s common to select from those rows based on columns that have low selectivity, but where the result sets are expected to be smaller than the base table itself. For example, if the selected set comprises 80% of the base table, then the benefit of this technique is not likely to be great, but if the selected set is only 25% of the table, then it will be significant. There is a trade-off in terms of the storage space required, but the space requirement is mitigated by storing only the matching rows from the base table, rather than all of them, as with a covering index. There is also some price in insert/update performance, but on many systems, the demands made by select queries far outweigh those made by inserts and updates.