Physical Join Operators in SQL Server – Hash Operator
Hash Match used when multiple lookups are the 2nd best alternative
In Query 9 below, both inputs are not small enough (P ~500 rows, SOD ~120,000 rows) to make nested loops efficient. It’s interesting to see that even though Product ID is indexed, nested loops will probably not be a good choice here. The main reason is that the index on Product ID does not cover the query and OrderQty will need to be looked up for each order which sums up to ~120,000 lookups.
![]() |
Query 9

*Exercise: In the demo code, you will find the same query without the OrderQty column in the select list (Query 9b). Try to guess what join operator will the optimizer choose when the need for lookups is eliminated? Try it…
Physical vs. Logical scan order
I would also like to draw your attention to another interesting property of the execution plan. If you look at the properties of either index scan above, you will see the value ‘false’ for the scan property ‘ordered’.
This means that the storage engine is not required to follow the logical chain links between the index leaf pages that point to the ‘next’ and ‘previous’ pages. Since the order of the retrieval of the rows is of no significance to the hash match operator, the storage engine will optimize the scan performance by scanning the index pages in their physical order (by following the IAM pages). This might prove to be a significant performance gain, especially for indexes with a high level of logical fragmentation. Go back and look at this property for index scans of the execution plans of the merge examples above.
Another thing we should note about hash joins is the fact that, in contrast to both nested loops and merge operators where the join operator may immediately start outputting joined rows for further processing, no rows can be outputted when using hash match until the whole ‘blue input’ is fully hashed.
Summary
In a nutshell, we can sum up the main properties of the three physical operators available to the SQL Server optimizer when carrying out Equi-Inner-Joins.
|
|
Nested Loops |
Merge |
Hash |
|
Good choice when |
Small outer input Inner input well indexed |
Large pre-sorted inputs Sorting required anyway |
Large inputs Inputs not well indexed |
|
CPU consumption |
Low |
Low *Unless requires sorting |
High |
|
Memory usage |
Low |
Low *Unless requires sorting |
High |
|
Logical reads |
High |
Low |
Low * ‘Hidden’ cost of probes |
|
Output matched rows |
Fast |
Fast |
Slow |
I hope this short discussion of the seemingly simple SQL construct raised your curiosity. There are many more aspects and considerations to joins which were not even mentioned in this article. This is a fascinating and highly complex subject. If you are interested to dive deeper and learn more on joins and their implementation in SQL Server, I highly recommend Craig Freedman’s SQL Server Blog on MSDN. Craig has published a series of excellent, in depth articles regarding many more aspects and types of joins.
I wish to thank my Editor, Bill Griffin for the excellent editing of this article and to Michael Zilberstein for a great technical review.
Pages: 1 2




First, I want to say I am a developer. Second, I am trying to understand Query 8. Why did the Optimizer choose IX_SalesOrderDetail_ProductID when that field isn’t being used or returned? Does it have to do with the way the secret sauce of the HASH function? Meaning, it saw that was the best index to use for hashing?
Hi Reece,
Very good question, and like all good questions, the answer is simple
You are correct that the ProductID is not needed for this query.
The only thing that is needed from the SalesOrderDetail table is a list of all SaleOrderIDs. The optimizer has several options to get it:
1. Scan the entire table using a clustered index scan = retrieving huge amount of unnecessary data = unnecessary IO and unnecessary memory usage.
2. Get the IDs by scanning a non-clustered index. Since SalesOrderID is part of the clustered index (that supports the primary key), it is included in every non-clustered index and used as the row pointer.
There are two non-clustered indexes on the table, one on ProductID and one on RowGUID. ProductID is a 4 byte INT, RowGUID is a 16 byte unique identifier.
Which index would you scan?
The optimizer chose the smaller one and simply discarded the ProductID.
You can see this in the execution plan if you look at the properties of the index scan operator, you will see that the output list consists only of SaleOrderID.
Try to create a non-clustered index just on SalesOrderID and see how that changes the plan.
Ami