Can some clever spark out there explain why, when I have a sorted stream (eg. using clustered index) which I then perform a hash join, result in an unordered stream? In the example below, basket table and basket_item table both have clustered indexed sorted by basket_id asc. without the inner hash join between product and basket, the query is happy merge joining without requiring a sort. As soon as I put the the hash join in it performs the hash join on the sorted stream (basket_item) - but then has to sort the resultant stream before merge joining to the basket table. I have over 800M rows in the basket table - so this is not a good plan! The query below is a little contrived to demonstrate the problem - my "real" query does not involve taking the min(major_department_code) - but if anyone has any idea how to remove this redundant sort, then please let me know. I have considered that it may have been to do with a parallel process, but seting MAXDOP to 1 does not resolve the query either. I'm sure there may be a technical reason for this, but the hash join process is described as "use each row from the top input to build a hash table, and each row in the bottom to probe into the hash table, outputing all matched results" - my "top" input is the product table - and my bottom query is the basket_item - so why, if the bottom is sorted, would it want to re-sort itself after the hash join? Code: select bb.basket_id , bi.min_mdc from mart.basket bb left merge join ( select a.basket_id , min_mdc = min(p.major_department_code) from mart.product p inner hash join mart.basket_item a on p.product_id = a.product_id group by a.basket_id ) bi on bi.basket_id = bb.basket_id
You are using an Index Scan here, and thus it is a VERY efficient query to begin with since it isn't even touching the base table at all. And a Hash is still probably the most efficient way to resolve this query. Try forcing a LOOP JOIN and MERGE JOIN to see if it changes.