As always, there is more than one way to get a correct result. But which way gives you the best performance? Let’s first build a simplified test scenario to illustrate what we’re after.
CREATE TABLE #t (ArticleGroup CHAR(2), Price DECIMAL(8,2))
INSERT INTO #t VALUES(‘SP’, 1.2)
INSERT INTO #t VALUES(‘SP’, 1.2)
INSERT INTO #t VALUES(‘SP’, 2.1)
INSERT INTO #t VALUES(‘GR’, 2.2)
INSERT INTO #t VALUES(‘GR’, 2.3)
INSERT INTO #t VALUES(‘GR’, 2.3)
INSERT INTO #t VALUES(‘GR’, 2.3)
According to our task the resultset should look like this:
ArticleGroup Price Number
———— ———- ———–
SP 1.20 2
GR 2.30 3
(2 row(s) affected)
Now how to achieve that result?
1. Approach
SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t a
GROUP BY ArticleGroup, Price
HAVING COUNT(*) >=
(SELECT MAX(Number)
FROM
(SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t b
WHERE b.ArticleGroup = a.ArticleGroup
GROUP BY ArticleGroup, Price) c)
ArticleGroup Price Number
———— ———- ———–
SP 1.20 2
GR 2.30 3
(2 row(s) affected)
2. Approach
SELECT ArticleGroup,Price, COUNT(*) Number
FROM #t a
GROUP BY ArticleGroup,Price
HAVING NOT EXISTS
(SELECT *
FROM #t b
WHERE b.ArticleGroup = a.ArticleGroup
GROUP BY b.Price
HAVING COUNT(b.ArticleGroup) > COUNT(a.ArticleGroup))
ArticleGroup Price Number
———— ———- ———–
SP 1.20 2
GR 2.30 3
(2 row(s) affected)
When you compare the execution plans, you’ll find a “Lazy Spool” in the second approach. Usually this is a bad sign and you should try to restate the query to work around that logical operator. The way to do this is by restating the first approach:
3. Approach
SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t t
GROUP BY ArticleGroup, Price
HAVING COUNT(*) =
(SELECT MAX(Number)
FROM
(SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t
GROUP BY ArticleGroup, Price) x
WHERE ArticleGroup = t.ArticleGroup)
ArticleGroup Price Number
———— ———- ———–
SP 1.20 2
GR 2.30 3
(2 row(s) affected)
So, now that we have our contestants, let’s see how they perform on a representative table. The table contains 250,000 rows, has 8 ArticleGroups with 358 different prices. Without any indexing on the table, you get the following runtimes:
1. Approach: 5.390 Milliseconds
2. Approach: 19.033 Milliseconds
3. Approach: 436 Milliseconds
If you now place an index on (ArticleGroup, Price) we can observe the following runtimes:
1. Approach: 153 Milliseconds
2. Approach: 500 Milliseconds
3. Approach: 110 Milliseconds
You’ll see that the third approach is in both cases the fastest one. But what is even more impressive is how much the “non-optimal” first and second approaches can gain from the index.
Just for completeness: you can replace MAX() with various TOP methods:
SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t a
GROUP BY ArticleGroup, Price
HAVING COUNT(*) =
(SELECT TOP 1 COUNT(*)
FROM #t
WHERE a.ArticleGroup = ArticleGroup
GROUP BY ArticleGroup, Price
ORDER BY COUNT(*) DESC)
Or:
SELECT ArticleGroup, Price, COUNT(*) AS Number
FROM #t t
GROUP BY ArticleGroup, Price
HAVING COUNT(*) =
(SELECT TOP 1 Number
FROM
(SELECT TOP 100 PERCENT ArticleGroup, Price, COUNT(*) AS Number
FROM #t
GROUP BY ArticleGroup, Price
ORDER BY Number DESC) x
WHERE ArticleGroup = t.ArticleGroup)
These will also yield the correct results but they don’t give any significant performance benefit.
When you’re on SQL Server 2005, you can take advantage of its new ranking functions:
SELECT ArticleGroup, Price, Cnt
FROM
(SELECT ArticleGroup, Price, COUNT(*) as Cnt,
RANK() OVER (PARTITION BY ArticleGroup
ORDER BY COUNT(*) DESC) AS Rnk
FROM #t
GROUP BY ArticleGroup, Price
) x
WHERE Rnk = 1