For this article series I am using an analogy of two sets of standard playing cards. One set with a blue back and another with a red back that need to be ‘joined’ together according to various join conditions. The two sets of cards simply represent the rows from the two joined tables, the red input and the blue input – “Now Neo – which input do you choose”?
A merge operator can be used only when both sets of rows are pre-sorted according to the join expression(s). For example, a Product table index and Order Detail table index both sorted by Product ID (recall Execution Plan 3 from the Nested Loops articles). The algorithm is extremely simple, elegant, and efficient:
Flowchart 2 – Merge
Since the rows are sorted according to the join expression(s), we can immediately begin the matching process. Simply get the first row from the blue input and the first row from the red input. If they match, output them and continue to the next row from the red input. If not, fetch the next row from the blue input and repeat the processes until all rows from the blue input have been processed. If we were to join our cards this way, we would first lay both sets on the table, sorted according to the join condition, let’s say suit and rank. We will probably spread them one row below the other and simply start picking the cards, moving from left to right on both rows and matching as we progress. Of course, we could decide to sort them just for this purpose even if they weren’t sorted to begin with.
Merge is a highly efficient operator
The merge join is probably the most efficient of all three operators. It combines the advantage of hash match where the actual data needs to be accessed only once with the advantages of nested loops – low CPU consumption and enabling of fast output of matched rows for further query processing. Moreover, it tops them both by eliminating the potential for a-sequential page flushes. Since the days of SQL Server 6.5, I have witnessed how the query optimizer tends to favor merge joins more and more with every new release. In SQL 2000, we would see merge joins almost exclusively for joins that had the appropriately pre-sorted indexes, and for queries that included an ORDER BY clause that required the sort. In SQL 2008 the optimizer cleverly realizes that the advantages of this join http://armodexperiment.com/hordenine operator justify pre-sorting of one or both inputs just for the sake of using merge in many more cases than previous versions did.
Let’s check out a few examples that use the Merge join operator.
Merge join with Clustered Indexes
The most obvious example, as can be seen in Query 5 , is a join that uses the keys of the clustered indexes of both tables. Since a clustered index is actually the table itself, sorted in the order of the clustered index key(s), the clustered index covers all queries and can be used to retrieve any (non BLOB) column in the table you specify in your SELECT clause. Even the SELECT * in the query below will not require any additional table lookups. But remember that there is still a penalty to pay for retrieving all columns unnecessarily as both tables will need to be fully loaded into memory and sent over the network.
*Challenge #2: There are no expression computations in this query. Try to guess what the ‘compute scalar’ operators in Execution Plan 5 above are for… Hint: The answer is at the columns node of these tables in SSMS object explorer.