How do I find the most frequently occurring value for all data in a table? How do I find that value per group?

As always, there is more than one way to get a correct result. But which way gives you the best performance? Let’s first build a simplified test scenario to illustrate what we’re after.

CREATE TABLE #t (ArticleGroup CHAR(2), Price DECIMAL(8,2))
INSERT INTO #t VALUES(‘SP’, 1.2)
INSERT INTO #t VALUES(‘SP’, 1.2)
INSERT INTO #t VALUES(‘SP’, 2.1)
INSERT INTO #t VALUES(‘GR’, 2.2)
INSERT INTO #t VALUES(‘GR’, 2.3)
INSERT INTO #t VALUES(‘GR’, 2.3)
INSERT INTO #t VALUES(‘GR’, 2.3)

According to our task the resultset should look like this:

ArticleGroup Price      Number
———— ———- ———–
SP           1.20       2
GR           2.30       3 (2 row(s) affected)

Now how to achieve that result?

1. Approach

SELECT ArticleGroup, Price, COUNT(*) AS Number
  FROM #t a
 GROUP BY ArticleGroup, Price
  HAVING COUNT(*) >=
   (SELECT MAX(Number)
      FROM
      (SELECT ArticleGroup, Price, COUNT(*) AS Number
         FROM  #t b
        WHERE b.ArticleGroup = a.ArticleGroup
         GROUP BY ArticleGroup, Price) c)

ArticleGroup Price      Number
———— ———- ———–
SP           1.20       2
GR           2.30       3 (2 row(s) affected)

2. Approach

SELECT ArticleGroup,Price, COUNT(*) Number
  FROM #t a
 GROUP BY ArticleGroup,Price
  HAVING NOT EXISTS
   (SELECT *
      FROM #t b
     WHERE b.ArticleGroup = a.ArticleGroup
     GROUP BY b.Price
     HAVING COUNT(b.ArticleGroup) > COUNT(a.ArticleGroup))

ArticleGroup Price      Number
———— ———- ———–
SP           1.20       2
GR           2.30       3 (2 row(s) affected)

When you compare the execution plans, you’ll find a “Lazy Spool” in the second approach. Usually this is a bad sign and you should try to restate the query to work around that logical operator. The way to do this is by restating the first approach:

3. Approach

SELECT ArticleGroup, Price, COUNT(*) AS Number
  FROM #t t
 GROUP BY ArticleGroup, Price
  HAVING COUNT(*) =
  (SELECT MAX(Number)
     FROM
      (SELECT ArticleGroup, Price, COUNT(*) AS Number
         FROM #t
        GROUP BY ArticleGroup, Price) x
         WHERE ArticleGroup = t.ArticleGroup)

ArticleGroup Price      Number
———— ———- ———–
SP           1.20       2
GR           2.30       3 (2 row(s) affected)

So, now that we have our contestants, let’s see how they perform on a representative table. The table contains 250,000 rows, has 8 ArticleGroups with 358 different prices. Without any indexing on the table, you get the following runtimes:

1. Approach: 5.390 Milliseconds
2. Approach: 19.033 Milliseconds
3. Approach: 436 Milliseconds

If you now place an index on (ArticleGroup, Price) we can observe the following runtimes:

1. Approach: 153 Milliseconds
2. Approach: 500 Milliseconds
3. Approach: 110 Milliseconds

You’ll see that the third approach is in both cases the fastest one. But what is even more impressive is how much the “non-optimal” first and second approaches can gain from the index.

Just for completeness: you can replace MAX() with various TOP methods:

SELECT ArticleGroup, Price, COUNT(*) AS Number
  FROM #t a
 GROUP BY ArticleGroup, Price
  HAVING COUNT(*) =
   (SELECT TOP 1 COUNT(*)
      FROM #t
     WHERE a.ArticleGroup = ArticleGroup
     GROUP BY ArticleGroup, Price
     ORDER BY COUNT(*) DESC)

Or:

SELECT ArticleGroup, Price, COUNT(*) AS Number
  FROM #t t
 GROUP BY ArticleGroup, Price
  HAVING COUNT(*) =
  (SELECT TOP 1 Number
     FROM
      (SELECT TOP 100 PERCENT ArticleGroup, Price, COUNT(*) AS Number
         FROM #t
        GROUP BY ArticleGroup, Price
         ORDER BY Number DESC) x
        WHERE ArticleGroup = t.ArticleGroup)

These will also yield the correct results but they don’t give any significant performance benefit.

When you’re on SQL Server 2005, you can take advantage of its new ranking functions:

SELECT ArticleGroup, Price, Cnt
  FROM
  (SELECT ArticleGroup, Price, COUNT(*) as Cnt,
   RANK() OVER (PARTITION BY ArticleGroup
   ORDER BY COUNT(*) DESC) AS Rnk
  FROM #t
  GROUP BY ArticleGroup, Price
  ) x
 WHERE Rnk = 1

]]>

Leave a comment

Your email address will not be published.