SQL Server Performance

Distinct vs. Group By

Discussion in 'T-SQL Performance Tuning for Developers' started by chilisauce, Sep 10, 2003.

  1. chilisauce New Member

    I'll bet your paycheck this thread has been posted before. But hey, repetition is a good thing... I hope?

    How does SQL2k handle the distinct keyword? Does SQL filter the duplicates on the fly? Does it return the entire result set and then filter the duplicates? Or is it something else? Conversely how does SQL handle a Goup BY?

    I would like to think there is a difference between the two other than handling aggregates. In addition I am interested in a more in depth technical explanation. Therefore, routing me to the BOL is not the desired result.

    Thanks in advance,

    chili
  2. gaurav_bindlish New Member

    Excerpt from Inside SQL Server 2K-

    GROUP BY Operations

    Prior to SQL Server 7, a GROUP BY operation was processed in only one way: SQL Server sorted the data (after applying any SARGs) and then formed the groups from the sorted data. SQL programmers noticed that queries involving grouping returned results in sorted order, although this was merely a by-product of the internal grouping mechanism. In SQL Server 2000, the query optimizer can choose to use hashing to organize the data into groups and compute aggregates. The pseudocode looks like this:

    For Each Record in the Input
    Determine the hash bucket
    Scan the hash bucket for matches
    If a match exists
    Aggregate the new record into the old
    Drop the new record
    Otherwise
    Insert the record into the bucket
    Scan and Output the Hash Table

    If the query optimizer chooses to use hashing to process the GROUP BY, the data does not come back in any predictable order. The order actually depends on the order in which records appear in the hash buckets, but without knowing the specific hash function, you can't predict this order. If you use a lot of GROUP BY queries, you might be surprised that sometimes the results come back sorted and sometimes they don't. If you absolutely need your data in a particular order, you should use ORDER BY in your query.

    NOTE
    --------------------------------------------------------------------------------
    If a database has its compatibility level flag set to 60 or 65, the query processor automatically includes an ORDER BY in the query. This allows you to mimic the behavior of older SQL Server versions.

    DISTINCT Operations

    As with GROUP BY queries, the only way to remove duplicates before SQL Server 7 was to first sort the data. In SQL Server 2000, the query optimizer can choose to use hashing to return only the distinct result rows. The algorithm is much like the algorithm for hashing with GROUP BY, except that an aggregate does not need to be maintained:

    For Each Record in the Input
    Determine the hash bucket
    Scan the hash bucket for matches
    If a match exists
    Drop the new record
    Otherwise
    Insert the record into the bucket
    and return the record as output

    HTH.

    Gaurav
    Moderator
    Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard
  3. chilisauce New Member

    Thanks for the info,

    What I gather from this article is there is little performance difference or impact between the two?

    -chili
  4. gaurav_bindlish New Member

    Almost... Depends on the information that you are looking. If you don't need the aggregate info, use DISTINCT

    Gaurav
    Moderator
    Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard

Share This Page