Improving 2D Range Query Performance in SQL Server


Can We Improve Performance?

In the fully general case—no, we cannot.  But odds are if you’re doing a BETWEEN on two different columns, those columns aren’t fully independent.  There is some sort of relationship between them, a relationship that you the developer knows, but SQL Server does not.  Feeding that  information to the QO can yield impressive results.

Example.Consider a table of events, each having a start and end date.   SQL Server doesn’t know everything we do about these two dates.  For example, end date must always be greater than start date.   Let’s also assume events have a limited duration: say no event will last longer than 5 days.   How can we pass this information to the QO?

Here’s our initial query:

SELECT*FROM Events

WHERE @date BETWEEN DateStart AND DateEnd

Because this is a 2D range query, it will be very slow. But since we know no event can last more than five days, let’s use a new predicate that encapsulates that:

SELECT*FROM Events

WHERE DateStart BETWEENDATEADD(dd,-5,@date)AND @date

This new query is now a 1D range: it will fully utilize a B-Tree index.  The only problem is that, while it returns all the rows in the original query, it also returns many that are not part of that query.  So let’s fix that by adding back the original predicate along with the new one:

SELECT*FROM Events

WHERE DateStart BETWEENDATEADD(dd,-5,@date)AND @date

AND @date BETWEEN DateStart AND DateEnd

This may look a little strange, but conceptually its easy to understand.  The first BETWEEN uses a rapid index search to narrow down our search.  The second BETWEEN then limits those results to exactly the ones we want.

How much can this boost performance?  Let’s do some benchmarking!

Test Data and Procedure

If you’re interested in a technique like this, chances are you have some large tables and slow queries.  So let’s not fool around with tiny data sets.  We’ll create a table with 30 million rows.  Each row represents an event, with a random start date between 30 years ago and today, and a random end date within 5 days of the start date.   A single 16-byte GUID is also present in each row (this doesn’t significantly affect the test results; it’s present only to better model non-SARG columns in the data row).

CREATETABLE Events

(

  ID     UniqueIdentifer NOTNULLDEFAULTNEWID(),

  DtStart DATETIMENOTNULL,

  DtEnd   DATETIMENOTNULL

)

DECLARE @i           BIGINT

DECLARE @offset_days INTEGER

DECLARE @offset_min  INTEGER

DECLARE @offset_end  INTEGER

DECLARE @DtStart     DateTime

SET @i = 0

WHILE @i <30000000 BEGIN

       SET @offset_days = 365*30 *RAND(CHECKSUM(NEWID()))

       SET @offset_min  = 1440 *RAND(CHECKSUM(NEWID()))

       SET @offset_end  = 5*1440 *RAND(CHECKSUM(NEWID()))

       SET @DtStart =DATEADD(n,@offset_min,DATEADD(dd,-@offset_days,GETDATE()))

       INSERT Events SELECT @DtStart,DATEADD(n,@offset_end,@DtStart),NEWID()

       SET @i = @i + 1

END

CREATEINDEX IX_EventInt ON Events(DtStart, DtEnd)

The query test script is below (one query was commented out for each set of runs)

DECLARE @stime DATETIME

CHECKPOINT

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

SET @stime =GETDATE()

– Original Query

SELECT*FROM Events WHERE @dt BETWEEN DtStart AND DtEnd

– Optimized Query

SELECT DtStart, DtEnd FROM Events WHERE DtStart BETWEENDATEADD(dd,-5,@dt)AND @dt AND @dt BETWEEN DtStart AND DtEnd

PRINTDATEDIFF(ms, @stime,GETDATE())

Benchmark Results

Tests were run five times each, with a CHECKPOINT and DBCC DROPCLEANBUFFERS between each test to limit the effects of caching. 

The first query had a mean run time of 19,400 ms (19.4s). 

The optimized query had a mean run time of 121ms.   That’s over 150 times faster than the original query.  These sorts of gains are not uncommon for cases in which a range query is preventing good use of the index.

Let’s look at another example.  You want to query an employee table for individuals who applied for a position on or after Jan 1, 2009, and were hired on or before Jan 1, 2010.  Your first attempt to write this query would be something like this:

SELECT*

FROM Employees

WHERE Apply_Date >=’20080101′AND Hire_Date <=’20090101′

Again, we have a 2D range query and poor performance.  How can we improve performance?  The key again is to build into the query information about the relationship between the two date columns, information the QO doesn’t have.  For instance, we know that someone cannot be hired before they have applied.   With that, we can add a new predicate to our query:

SELECT*

FROM Employees

WHERE Hire_Date BETWEEN’20080101′AND’20090101′AND Apply_Date >=’20080101′

This may appear to be somewhat redundant, but remember that the QO doesn’t know what we know.  Without the extra predicate, it will choose one of the two possibilities, depending on index stats:

Scan index on Apply_Date for all rows where Apply_date between Jan 1 2009 and (max_date)

Scan index on Hire_Date for all rows where Hire_Date between (min_date) and Jan 1, 2009.

Statistics will let you do a little better than scanning half the rows in your table, but the engine will still wind up scanning multiple years worth of data.  With the extra predicate, we have to examine a single year only.

Conclusion

This is a powerful technique that can yield amazing performance gains for certain types of range queries.  It is also something you should bear in mind for all queries in general.  Whenever you have two or more columns that aren’t fully independent, but have some sort of relationship between them, you have an opportunity to feed that information to SQL Server, and optimize your queries.

Pages: 1 2




Related Articles :

One Response to “Improving 2D Range Query Performance in SQL Server”

  1. “Some” hints to Script in the “Test Data and Procedure”:
    1. UniqueIdentifer

    schould be UniqueIdentifier

    2. INSERT Events SELECT @DtStart,DATEADD(n,@offset_end,@DtStart),NEWID()

    schould be
    INSERT Events SELECT NEWID(),@DtStart,DATEADD(n,@offset_end,@DtStart)

    3. “Some” lack of spaces NOTNULLDEFAULTNEWID – looks like german numbers sechshundertsechsundsechzig :)

    4. @dt is not declared

    Was this inaccuracy well-planned to identify the feedback? bazinga :)

Software Reviews | Book Reviews | FAQs | Tips | Articles | Performance Tuning | Audit | BI | Clustering | Developer | Reporting | DBA | ASP.NET Ado | Views tips | | Developer FAQs | Replication Tips | OS Tips | Misc Tips | Index Tuning Tips | Hints Tips | High Availability Tips | Hardware Tips | ETL Tips | Components Tips | Configuration Tips | App Dev Tips | OLAP Tips | Admin Tips | Software Reviews | Error | Clustering FAQs | Performance Tuning FAQs | DBA FAQs |