Speed Up SELECT DISTINCT Queries

The performance improvement you get depends on the ratio of matching rows in the left and right (or inner and outer) tables. The query below will work in any SQL Server database. Try pasting the two queries into Query Analyzer and comparing the execution plan and I/O costs the two produce in different databases. The second query usually comes out as more efficient, though the actual performance gain varies.

SELECT DISTINCT o.name
FROM sysobjects o
JOIN sysindexes i
ON o.id = i.id
WHERE o.type = ‘U’

SELECT o.name
FROM sysobjects o
WHERE o.type = ‘U’
AND EXISTS (
SELECT 1
FROM sysindexes i
WHERE o.id = i.id
)

You need to understand the relationship between the two (or more) tables you are joining in order to execute this trick properly. The two Northwind database queries below are designed to return customer IDs where a discount of more than 2 percent has been given on any item. At first glance, these two queries appear to produce the same results because they follow the format in the examples above, but the results you get are actually different.

SELECT DISTINCT customerID
FROM orders o
JOIN [order details] od
ON o.OrderID = od.OrderID
WHERE discount > 0.02

SELECT customerID
FROM orders o
WHERE EXISTS (
SELECT *
FROM [order details] od
WHERE o.OrderID = od.OrderID
AND discount > 0.02
)

These examples do not match up because it is OrderID that defines the relationship between the two tables, not the customer name. The second query will return multiple customer names, one for each order placed by the customer. Try adding the OrderID column into the SELECT list to see this.

So the next time you find yourself using the SELECT DISTINCT statement, take a moment to see if it can be re-written for improved performance. You may be surprised at what a little re-coding can do for your application.

]]>

Leave a comment

Your email address will not be published.