Physical Join Operators in SQL Server – Hash Operator


Hash Match used when multiple lookups are the 2nd best alternative

In Query 9 below, both inputs are not small enough (P ~500 rows, SOD ~120,000 rows) to make nested loops efficient. It’s interesting to see that even though Product ID is indexed, nested loops will probably not be a good choice here. The main reason is that the index on Product ID does not cover the query and OrderQty will need to be looked up for each order which sums up to ~120,000 lookups.

Text Box: SELECT	SOD.SalesOrderID,
            P.Name,
            P.color,
            SOD.OrderQty
            FROM		Production.Product AS P
            INNER JOIN
            Sales.SalesOrderDetail AS SOD
            ON P.ProductID = SOD.ProductID
 

 


Query 9

Execution Plan 9

*Exercise: In the demo code, you will find the same query without the OrderQty column in the select list (Query 9b). Try to guess what join operator will the optimizer choose when the need for lookups is eliminated? Try it…


Physical vs. Logical scan order

I would also like to draw your attention to another interesting property of the execution plan. If you look at the properties of either index scan above, you will see the value ‘false’ for the scan property ‘ordered’.

This means that the storage engine is not required to follow the logical chain links between the index leaf pages that point to the ‘next’ and ‘previous’ pages. Since the order of the retrieval of the rows is of no significance to the hash match operator, the storage engine will optimize the scan performance by scanning the index pages in their physical order (by following the IAM pages). This might prove to be a significant performance gain, especially for indexes with a high level of logical fragmentation. Go back and look at this property for index scans of the execution plans of the merge examples above.

Another thing we should note about hash joins is the fact that, in contrast to both nested loops and merge operators where the join operator may immediately start outputting joined rows for further processing, no rows can be outputted when using hash match until the whole ‘blue input’ is fully hashed.

 

Summary

In a nutshell, we can sum up the main properties of the three physical operators available to the SQL Server optimizer when carrying out Equi-Inner-Joins.

 

Nested Loops

Merge

Hash

Good choice when

Small outer input

Inner input well indexed

Large pre-sorted inputs

Sorting required anyway

Large inputs

Inputs not well indexed

CPU consumption

Low

Low

*Unless requires sorting

High

Memory usage

Low

Low

*Unless requires sorting

High

Logical reads

High

Low

Low

* ‘Hidden’ cost of probes

Output matched rows

Fast

Fast

Slow

I hope this short discussion of the seemingly simple SQL construct raised your curiosity. There are many more aspects and considerations to joins which were not even mentioned in this article. This is a fascinating and highly complex subject. If you are interested to dive deeper and learn more on joins and their implementation in SQL Server, I highly recommend Craig Freedman’s SQL Server Blog on MSDN. Craig has published a series of excellent, in depth articles regarding many more aspects and types of joins.

I wish to thank my Editor, Bill Griffin for the excellent editing of this article and to Michael Zilberstein for a great technical review.

Pages: 1 2




Related Articles :

  • No Related Articles Found

2 Responses to “Physical Join Operators in SQL Server – Hash Operator”

  1. First, I want to say I am a developer. Second, I am trying to understand Query 8. Why did the Optimizer choose IX_SalesOrderDetail_ProductID when that field isn’t being used or returned? Does it have to do with the way the secret sauce of the HASH function? Meaning, it saw that was the best index to use for hashing?

  2. Hi Reece,
    Very good question, and like all good questions, the answer is simple :-)
    You are correct that the ProductID is not needed for this query.
    The only thing that is needed from the SalesOrderDetail table is a list of all SaleOrderIDs. The optimizer has several options to get it:
    1. Scan the entire table using a clustered index scan = retrieving huge amount of unnecessary data = unnecessary IO and unnecessary memory usage.
    2. Get the IDs by scanning a non-clustered index. Since SalesOrderID is part of the clustered index (that supports the primary key), it is included in every non-clustered index and used as the row pointer.
    There are two non-clustered indexes on the table, one on ProductID and one on RowGUID. ProductID is a 4 byte INT, RowGUID is a 16 byte unique identifier.
    Which index would you scan?
    The optimizer chose the smaller one and simply discarded the ProductID.
    You can see this in the execution plan if you look at the properties of the index scan operator, you will see that the output list consists only of SaleOrderID.
    Try to create a non-clustered index just on SalesOrderID and see how that changes the plan.

    Ami

Software Reviews | Book Reviews | FAQs | Tips | Articles | Performance Tuning | Audit | BI | Clustering | Developer | Reporting | DBA | ASP.NET Ado | Views tips | | Developer FAQs | Replication Tips | OS Tips | Misc Tips | Index Tuning Tips | Hints Tips | High Availability Tips | Hardware Tips | ETL Tips | Components Tips | Configuration Tips | App Dev Tips | OLAP Tips | Admin Tips | Software Reviews | Error | Clustering FAQs | Performance Tuning FAQs | DBA FAQs |