SQL Server Optimizer, Under the Hood – Partial Aggregates
Abstract: In this series of short articles, we lift the hood of the SQL Server Optimizer to examine a few of the many clever tricks used to optimize query performance. You’ll see that the Optimizer does not limit itself to using only the instructions provided by the written query syntax, but can correctly deduce a more efficient approach from information it gleans from the schema, the query and data statistical information.
In Part I, we look at partial aggregate operators – an extremely clever way of downsizing certain large joins into much smaller (and faster) joins.
In my previous series of articles, we examined the three different physical join operators (Nested Loops, Merge and Hash Match) and saw how the Optimizer chooses which is best to use for different types of queries and for varying data patterns. In this series, we will take things one step further and describe some of the other tricks the Optimizer has stored up its sleeve for us when it comes to optimizing query performance. The first sleight-of-hand, known as partial aggregation, is a technique commonly used for joins with particular characteristics. An example follows.
Note : To execute this article’s demo code you will need to download the AdvertureWorks2008R2 database from CodePlex. However, this article is written to be a full explanation without using the sample database or code.
So without further ado, let’s jump right in to the mysterious maze of the Optimizer’s mind.
Consider the following query:
USE AdventureWorks2008R2 GO -- Number of Orders Per Sales Person Name SELECT P.FirstName, P.LastName, COUNT(*) AS [Number Of Orders] FROM Person.Person AS P INNER JOIN Sales.SalesOrderHeader AS SOH ON P.BusinessEntityID = SOH.SalesPersonID GROUP BY P.FirstName, P.LastName ORDER BY [Number Of Orders] DESC GO
The query itself is quite simple and straightforward. It produces a list of names of sales people and their order counts, arranged so that the highest-producing sales person name is listed first.
It joins the Person table with the SalesOrderHeader table, groups the result by the sales people’s first name and last name, and counts the number of orders each sales person has obtained. Note that there is a potential logical issue with this query as grouping by first name and last name does not guarantee uniqueness of sales person because two sales people may share the same name. This may lead to potentially confusing results when “John Smith” is only listed once in the result set, and “his/hers” sales look much larger than either of them expected. But because the AdventureWorks database does not contain identical first and last names, we’ll ignore this issue for now.
At first glance, we might expect the Optimizer to take a straightforward approach to execute this simple query:
1. Join the two tables.
2. Group by the first name and last name columns.
3. Count the number of orders in each group.
Obvious enough, isn’t it?
Well, let’s take a look at the actual execution plan that was used, in the figure below.
Figure 1 – Actual Execution Plan
The execution plan seems to be exactly as expected, except for the Stream Aggregate operator  which follows the Index Scan operator  on the SalesOrderHeader Table. What is it doing there? It can’t aggregate order counts by First Name and Last Name quite yet… the First Name and Last Name won’t be available until after the join is complete, because those columns are from the Person table which has not been accessed yet… What the heck is going on?