Improving 2D Range Query Performance in SQL Server

Can We Improve Performance?

In the fully general case—no, we cannot. 
But odds are if you’re doing a BETWEEN on two different columns, those columns
aren’t fully independent.  There is some sort of relationship between them, a
relationship that you the developer knows, but SQL Server does not.  Feeding
that  information to the QO can yield impressive results.

Example.Consider a table of events, each
having a start and end date.   SQL Server doesn’t know everything we do about
these two dates.  For example, end date must always be greater than start
date.   Let’s also assume events have a limited duration: say no event will
last longer than 5 days.   How can we pass this information to the QO?

Here’s our initial query:

SELECT*FROM Events

WHERE
@date BETWEEN DateStart AND DateEnd

Because this is a 2D range query, it will
be very slow. But since we know no event can last more than five days, let’s
use a new predicate that encapsulates that:

SELECT*FROM Events

WHERE DateStart BETWEENDATEADD(dd,-5,@date)AND @date

This new query is now a 1D range: it will
fully utilize a B-Tree index.  The only problem is that, while it returns all
the rows in the original query, it also returns many that are not part of that
query.  So let’s fix that by adding back the original predicate along with the
new one:

SELECT*FROM Events

WHERE DateStart BETWEENDATEADD(dd,-5,@date)AND @date

AND
@date BETWEEN DateStart AND DateEnd

This may look a little strange, but
conceptually its easy to understand.  The first BETWEEN uses a rapid index
search to narrow down our search.  The second BETWEEN then limits those results
to exactly the ones we want.

How much can this boost performance?  Let’s
do some benchmarking!

Test Data and Procedure

If you’re interested in a technique like
this, chances are you have some large tables and slow queries.  So let’s not
fool around with tiny data sets.  We’ll create a table with 30 million rows. 
Each row represents an event, with a random start date between 30 years ago and
today, and a random end date within 5 days of the start date.   A single
16-byte GUID is also present in each row (this doesn’t significantly affect the
test results; it’s present only to better model non-SARG columns in the data
row).

CREATETABLE Events

(

 
ID     UniqueIdentifer NOTNULLDEFAULTNEWID(),

 
DtStart DATETIMENOTNULL,

 
DtEnd   DATETIMENOTNULL

)

DECLARE @i           BIGINT

DECLARE @offset_days INTEGER

DECLARE @offset_min  INTEGER

DECLARE @offset_end  INTEGER

DECLARE @DtStart     DateTime

SET
@i = 0

WHILE @i <30000000 BEGIN

       SET @offset_days =
365*30 *RAND(CHECKSUM(NEWID()))

       SET @offset_min  =
1440 *RAND(CHECKSUM(NEWID()))

       SET @offset_end  = 5*1440 *RAND(CHECKSUM(NEWID()))

       SET @DtStart =DATEADD(n,@offset_min,DATEADD(dd,-@offset_days,GETDATE()))

       INSERT Events SELECT
@DtStart,DATEADD(n,@offset_end,@DtStart),NEWID()

       SET @i = @i + 1

END

CREATEINDEX IX_EventInt ON
Events(DtStart,
DtEnd)

The query test script is below (one query
was commented out for each set of runs)

DECLARE @stime DATETIME

CHECKPOINT

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

SET
@stime =GETDATE()

— Original Query

SELECT*FROM Events WHERE @dt BETWEEN
DtStart AND DtEnd

— Optimized Query

SELECT DtStart, DtEnd FROM Events WHERE
DtStart BETWEENDATEADD(dd,-5,@dt)AND @dt AND @dt BETWEEN
DtStart AND DtEnd

PRINTDATEDIFF(ms,
@stime,GETDATE())

Benchmark Results

Tests were run five times each, with a CHECKPOINT and DBCC DROPCLEANBUFFERS between each test to limit the effects
of caching. 

The first query had a mean run time of
19,400 ms (19.4s). 

The optimized query had a mean run time of
121ms.   That’s over 150 times faster than the original query. 
These sorts of gains are not uncommon for cases in which a range query is
preventing good use of the index.

Let’s look at another example.  You want to
query an employee table for individuals who applied for a position on or after
Jan 1, 2009, and were hired on or before Jan 1, 2010.  Your first attempt to
write this query would be something like this:

SELECT*

FROM Employees

WHERE Apply_Date >=’20080101’AND
Hire_Date <=’20090101′

Again, we have a 2D range query and poor
performance.  How can we improve performance?  The key again is to build into
the query information about the relationship between the two date columns,
information the QO doesn’t have.  For instance, we know that someone cannot be hired
before they have applied.   With that, we can add a new predicate to our query:

SELECT*

FROM Employees

WHERE Hire_Date BETWEEN’20080101’AND’20090101’AND
Apply_Date >=’20080101′

This may appear to be somewhat redundant,
but remember that the QO doesn’t know what we know.  Without the extra
predicate, it will choose one of the two possibilities, depending on index
stats:

Scan index on Apply_Date for all rows where
Apply_date between Jan 1 2009 and (max_date)

Scan index on Hire_Date for all rows where
Hire_Date between (min_date) and Jan 1, 2009.

Statistics will let you do a little better
than scanning half the rows in your table, but the engine will still wind up
scanning multiple years worth of data.  With the extra predicate, we have to
examine a single year only.

Conclusion

This is a powerful technique that can yield
amazing performance gains for certain types of range queries.  It is also
something you should bear in mind for all queries in general.  Whenever you
have two or more columns that aren’t fully independent, but have some sort of
relationship between them, you have an opportunity to feed that information to
SQL Server, and optimize your queries.

]]>

Leave a comment

Your email address will not be published.