Not All SQL Server Indexes Are Created Equal
If you have much experience with indexes at all, you are probably already familiar with the difference between clustered and non-clustered indexes. But this article is not about them. This article is about whether or not the SQL Server Query Optimizer will use your carefully crafted indexes. You may not be aware of this, but just because a column has an index doesn’t mean the Query Optimizer will use it. As you can imagine, creating an index that never will be used is a waste of time, and in the worst cases, it can even reduce the performance of your application. Let’s learn why.
To start out, let’s look at a simple example. Let’s assume we have an accounting database. In that database is a table called “orders”. Among a number of different columns in this table, we are interested in two columns: “orderid” and “employeeid”. This table has 150,000 rows and there is non-clustered index on the “employeeid” table. Now let’s say we want to run the following query:
SELECT orderid FROM orders WHERE employeeid = 5
The first thing to notice about the query is that the “employeeid” column used in the WHERE clause of the query has a non-clustered index on it. Because of this, you would most likely assume that when this query is run through the Query Optimizer, that the Query Optimizer would use the index to produce the requested results.
Unfortunately, you can’t automatically make this assumption. Just because there is an available index doesn’t necessarily mean that the Query Optimizer will use it. This is because the Query Analyzer always evaluates whether or not an index is useful before it is used. If the Query Analyzer examines an index and finds it not useful, it will ignore it, and if necessary, it will perform a table scan to produce the requested results.
So what is a useful index? In order to answer this question, we need to understand that one of the biggest underlying goals of the Query Optimizer is to reduce the amount of I/O, and the corresponding amount of time it takes to perform execute a query. In other words, the Query Optimizers evaluates many different ways a query can be executed, and finds the one it thinks will produce the least amount of I/O. But what may be surprising is that using an available index does not always mean that it will result is the least amount of I/O used. In many cases, especially with non-clustered indexes, a table scan can produce less I/O than an available index.
Before the Query Optimizer uses an index, the Query Optimizer evaluates the index to see if it is selective enough. What does this mean? Selectivity refers to the percentage of rows in a table that are returned by a query. A query is considered highly selective if it returns a very limited number of rows. A query is considered to have low selectivity if it returns a high percentage of rows. Generally speaking, if a query returns less than 5% of the number of rows in a table, it is considered to have high selectivity, and the index will most likely be used. If the query returns from 5% – 10% of the rows, the index may or may not be used. If the query returns more than 10% of the rows, the index most likely will not be used. And assuming there are no other useful indexes for the query, a table scan will be performed.
Let’s return to our example query:
SELECT orderid FROM orders WHERE employeeid = 5
Just by looking at the query we don’t know if the available index on employeeid will be used or not. Let’s say that we know that of the 150,000 rows in the table, that “employeeid = 5” is true for 5,000 of the records. If we divide 5,000 by 150,000 we get 3.3%. Since 3.3% is less than 5%, the Query Optimizer will most likely use the available index. But what if “employeeid = 5” is true for 25,000 instead. In this case, we divide 25,000 by 150,000 and we get 16.6%. Since 16.6% is greater than 5%, or even 10%, the Query Optimizer most likely will not use the index and instead perform a table scan.
So how can a table scan use less I/O than using an index, such as the non-clustered index in our example? Non-clustered indexes are great if the index is highly selective, especially if you will be returning one record. But if many records will be returned, and the index is not very selective, it is very expensive in I/O to retrieve the data. The reason for this is that the Query Optimizer has to first go to the index to locate the data (using up I/O) and then it has to go to the table to retrieve it (more I/O). At some point, the Query Optimizer determines that it takes less I/O to scan an entire table than it does to go back and forth between the index and the table to retrieve the requested rows.
The example given above applies mostly to non-clustered indexes. If the available index is a clustered index, then the index may be used, even if there is low selectivity, because the index is the table and I/O operations can be very efficient.
So how does the Query Optimizer know if an available index is selective enough to be useful? It does this by maintaining index statistics on each index in every table. Index statistics are a histogram of values that are stored in the sysindexes table. These statistics are a sampling of the available rows that tells the Query Optimizer approximately how selective a particular index is.
Index statistics are created every time an index is created, rebuilt, when the UPDATE STATISTICS command is run, and automatically by the Query Optimizer as the need arises. Index statistics are not maintained in real time because that would put too much overhead on the server. But because index statistics are not real time, they can get out of date, and sometimes, the Query Optimizer can make a poor choice because the index statistics it uses are not current.
But just because statistics are current doesn’t mean that the Query Optimizer will use an available index. Remember, the Query Optimizer bases its decision on the selectivity of an index, and the Query Optimizer uses the index statistics to determine selectivity.
So if the Query Optimizer can check to see if a particular index is useful or not, how can we do the same thing? Fortunately, there is a command that let’s us examine an index and find out if a particular index is selective enough to be used.
Pages: 1 2