Understanding Scans And Seeks

There is a popular theory on scans and seeks: “Scan reads through the object (table/index) from top to bottom looking for the records that it needs while seek goes directly to the part of the object that it needs and reads to where the data that it needs ends. This is obviously a must more efficient operation than a scan, as database server already knows where the data is that it is looking for”. This article analyzes the validity of the theory.

The problem

Some time back I had a situation where I had to find the last key of a table. The table was designed to have an integer column as the primary key with the key always incrementing. However the table did not have an identity column. It forced me to find the last key before the insert occurs. As the table grew over time, each operation against the table became increasingly costly.
I used a query similar to the one below to find the last value of the key. (This example uses AdventureWorks database.)

SELECT TOP 1 TransactionID
FROM Production.TransactionHistory

One day when I was checking something, I happened to see the execution plan.

I was puzzled. It was using a scan (In my case a clustered index scan). A clustered index is organized slightly differently than a non-clustered index. While all other indexes have separate index pages for the entire index, clustered indexes use the data pages for the leaf level entries. Does that mean SQL server is scanning the entire table to get just one row? It can easily find the last entry in the index and get the value. So in my opinion it should be a seek operation. So, why I observe this peculiar behavior? br />

My Solution

I remembered I read somewhere that SQL Server uses backward search as well. (That is, searching from last entry towards the first entry). May be it is not true then, I thought, and removed that DESC keyword from the query to check the execution plan. But it didn’t make any difference in the plan. It was a confirmation that SQL Server treats backward and forward searches the same way. However, it did not solve my problem.

I wrote the query in different ways, such as below:

DECLARE @Rows bigint
SET @Rows = 1
SELECT TOP (@Rows) TransactionID
FROM Production.TransactionHistory

To add to my frustration, it gave the same execution plan with scan operation
taking almost 100% of the total cost.

SELECT TransactionID
FROM Production.TransactionHistory

The execution plan changed only slightly. When the Top operator is removed, I saw a slight improvement in the total estimated cost, but still the major operation stayed as a clustered index scan.
The scan changed into seek when I wrote the query differently:

SELECT TransactionID
FROM Production.TransactionHistory
WHERE TransactionID <= 2147483647


2147483647 is the largest number “int” data type can hold. So, it is clear the results will not change because of the additional condition. But to my amusement, the execution plan changed.

Hurray! I have made the breakthrough. Now I need to find what improvement it makes. So I took my original query and the new one and executed them. Both queries executed within almost the same time. Except for the improvement I saw by removing the top clause, I saw no additional improvement. The execution plan said both queries share the same cost (50% of the total cost). What? Clustered index scan and clustered index seek having the same cost? Impossible, I thought. So, I decided to add some additional methods to do a calculation.

The first thing I added was SET STATISTICS TIME ON. Unfortunately, I didn’t see any difference there either.

Then I added SET STATISTICS IO ON and executed the statements, but they too returned the same numbers.

(1 row(s) affected)

Table ‘TransactionHistory’. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0

(1 row(s) affected)

Table ‘TransactionHistory’. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0

Note: I have removed some information to make it simple.
Now I have more questions than answers. How come, both index seek and scan complete with just 3 logical pages of read? Is this index that small? But I have millions of rows in the table. Even though I was working with an old copy of the production database, I know it has reasonably large table there.
I checked the Index depth and index pages using dynamic management views

SELECT i.name, p.* 
FROM sys.sysindexes i INNER JOIN
sys.dm_db_index_physical_stats(db_ID(), Object_ID('Production.TransactionHistory'), NULL, NULL, DEFAULT) p
ON i.id = p.object_id AND i.indid = p.index_id

I found that the primary key has many rows and the depth is 3 (a root page, one
intermediate level page and leaf level).

It took me some time to understand what seek and scan operations are and how
they work. Prior to that I was under the influence of the popular thinking that
seek reads a single row or few pages at a time and scan reads all or almost all
pages partially because I have combined two different statements:

  • Scan is better when many rows are returned and seek is better when only a few
    rows are returned.
  • SQL Server can determine the  execution plan

On the contrary, in my case both seek and scan performed with the same cost.
Also, the operation is not determined by the number of rows (or even the
percentage of the total rows).



Pages: 1 2 3


No comments yet... Be the first to leave a reply!