Distinct vs. Group By | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Distinct vs. Group By

I’ll bet your paycheck this thread has been posted before. But hey, repetition is a good thing… I hope? How does SQL2k handle the distinct keyword? Does SQL filter the duplicates on the fly? Does it return the entire result set and then filter the duplicates? Or is it something else? Conversely how does SQL handle a Goup BY? I would like to think there is a difference between the two other than handling aggregates. In addition I am interested in a more in depth technical explanation. Therefore, routing me to the BOL is not the desired result. Thanks in advance, chili
Excerpt from Inside SQL Server 2K- GROUP BY Operations Prior to SQL Server 7, a GROUP BY operation was processed in only one way: SQL Server sorted the data (after applying any SARGs) and then formed the groups from the sorted data. SQL programmers noticed that queries involving grouping returned results in sorted order, although this was merely a by-product of the internal grouping mechanism. In SQL Server 2000, the query optimizer can choose to use hashing to organize the data into groups and compute aggregates. The pseudocode looks like this: For Each Record in the Input
Determine the hash bucket
Scan the hash bucket for matches
If a match exists
Aggregate the new record into the old
Drop the new record
Otherwise
Insert the record into the bucket
Scan and Output the Hash Table If the query optimizer chooses to use hashing to process the GROUP BY, the data does not come back in any predictable order. The order actually depends on the order in which records appear in the hash buckets, but without knowing the specific hash function, you can’t predict this order. If you use a lot of GROUP BY queries, you might be surprised that sometimes the results come back sorted and sometimes they don’t. If you absolutely need your data in a particular order, you should use ORDER BY in your query. NOTE
——————————————————————————–
If a database has its compatibility level flag set to 60 or 65, the query processor automatically includes an ORDER BY in the query. This allows you to mimic the behavior of older SQL Server versions. DISTINCT Operations As with GROUP BY queries, the only way to remove duplicates before SQL Server 7 was to first sort the data. In SQL Server 2000, the query optimizer can choose to use hashing to return only the distinct result rows. The algorithm is much like the algorithm for hashing with GROUP BY, except that an aggregate does not need to be maintained: For Each Record in the Input
Determine the hash bucket
Scan the hash bucket for matches
If a match exists
Drop the new record
Otherwise
Insert the record into the bucket
and return the record as output HTH. Gaurav
Moderator
Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard

Thanks for the info, What I gather from this article is there is little performance difference or impact between the two? -chili

Almost… Depends on the information that you are looking. If you don’t need the aggregate info, use DISTINCT Gaurav
Moderator
Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard

]]>