SQL Server Performance

Why DISTINCT takes so long time?

Discussion in 'T-SQL Performance Tuning for Developers' started by Jelly0228, Nov 17, 2008.

  1. Jelly0228 New Member

    Hi all,
    Could you help me check following query code? It takes 9 seconds to return the result, but rows of it is just 2. If I remove DISTINCT, the query will return an output with 3 rows, and the runing time is less than 1 second! So, it seems that SQL Server takes more than 8 seconds to remove the duplicate record from 3 rows to 2 rows.
    ---------------------------------------a begin---------------------
    select distinct b.demo_no,b.artic_no
    from
    (
    select podr_no from xmku_miss
    where cdj_date='2008/11/11'
    and sec_no='30013'
    union
    select podr_no from btjk_miss
    where jk_date='2008/11/11'
    and sec_no='30013' ) a
    join pagodrm_miss b on a.podr_no=b.podr_no
    ---------------------------------------a end---------------------
    I have another test here:
    --------------------------------------------------------------
    select podr_no into #tt from xmku_miss
    where cdj_date='2008/11/11'
    and sec_no='30013'
    union
    select podr_no from btjk_miss
    where jk_date='2008/11/11'
    and sec_no='30013'
    select distinct b.demo_no,b.artic_no
    from
    #tt a join pagodrm_miss b on a.podr_no=b.podr_no
    ------------------------------------------------------------------------------
    This SQL statement will return correct data within 1 second. Could you help me on this issue?
  2. Jelly0228 New Member

    My sql server version is SQL Server 2000 Enterprise Edtion SP4. Thanks for any advices.
  3. atulmar New Member

    I am not sure why your query is taking long time, but I have one workaround, store your intermediate result into table variable (as you result is not huge), and select distinct from table variable.
  4. TommCatt New Member

    One reason DISTINCT should be used only when absolutely necessary is that the result set must be sorted in order to find and remove duplicates. While that doesn't seem like that would make a whole lot of difference in your case as the result set is only three rows, the optimizer doesn't know ahead of time how many there will be. I'm sure it must plan for the possibility that there could be 1000's of rows.One thing you may want to do is compare the execution plans of both queries and see where they start to differ.
  5. Jelly0228 New Member

    [quote user="TommCatt"]One reason DISTINCT should be used only when absolutely necessary is that the result set must be sorted in order to find and remove duplicates. While that doesn't seem like that would make a whole lot of difference in your case as the result set is only three rows, the optimizer doesn't know ahead of time how many there will be. I'm sure it must plan for the possibility that there could be 1000's of rows.
    One thing you may want to do is compare the execution plans of both queries and see where they start to differ.
    [/quote]
    Thanks.
    According to your suggestion, I compared the two exection plans this morning, but I still couldn't find why DISTINCT takes so long time. Following are the two execution plans, Is there anybody has ideas?
    1. Execution Plan with DISTINCT (14 rows)
    select distinct b.demo_no,artic_no from ( select podr_no from xmku_miss where cdj_date='2008/11/11' and sec_no='30013' union select podr_no from btjk_miss where jk_date='2008/11/11' and sec_no='30013' ) a join pagodrm_miss b on a.podr_no=b.pod
    |--Sort(DISTINCT ORDER BY:(.[demo_no] ASC, .[artic_no] ASC))
    |--Bookmark Lookup(BOOKMARK:([Bmk1005]), OBJECT:([Prod].[dbo].[pagodrm_miss] AS ))
    |--Parallelism(Gather Streams)
    |--Nested Loops(Inner Join, OUTER REFERENCES:([Union1004]))
    |--Stream Aggregate(GROUP BY:([Union1004]))
    | |--Parallelism(Repartition Streams, PARTITION COLUMNS:([Union1004]), ORDER BY:([Union1004] ASC))
    | |--Stream Aggregate(GROUP BY:([Union1004]))
    | |--Sort(ORDER BY:([Union1004] ASC))
    | |--Concatenation
    | |--Index Scan(OBJECT:([Prod].[dbo].[xmku_miss].[PK_xmku_miss]), WHERE:([xmku_miss].[cdj_date]='2008/11/11' AND [xmku_miss].[sec_no]='30013'))
    | |--Bookmark Lookup(BOOKMARK:([Bmk1002]), OBJECT:([Prod].[dbo].[btjk_miss]))
    | |--Index Scan(OBJECT:([Prod].[dbo].[btjk_miss].[IDX_BTJK_MISS_BTL_NO]), WHERE:([btjk_miss].[jk_date]='2008/11/11' AND [btjk_miss].[sec_no]='30013'))
    |--Index Seek(OBJECT:([Prod].[dbo].[pagodrm_miss].[PK_pagodrm_miss] AS ), SEEK:(.[podr_no]=[Union1004]) ORDERED FORWARD)
    2. Execution Plan without DISTINCT (13 rows)
    select b.demo_no,artic_no from ( select podr_no from xmku_miss where cdj_date='2008/11/11' and sec_no='30013' union select podr_no from btjk_miss where jk_date='2008/11/11' and sec_no='30013' ) a join pagodrm_miss b on a.podr_no=b.podr_no
    |--Bookmark Lookup(BOOKMARK:([Bmk1005]), OBJECT:([Prod].[dbo].[pagodrm_miss] AS ))
    |--Parallelism(Gather Streams)
    |--Nested Loops(Inner Join, OUTER REFERENCES:([Union1004]))
    |--Stream Aggregate(GROUP BY:([Union1004]))
    | |--Parallelism(Repartition Streams, PARTITION COLUMNS:([Union1004]), ORDER BY:([Union1004] ASC))
    | |--Stream Aggregate(GROUP BY:([Union1004]))
    | |--Sort(ORDER BY:([Union1004] ASC))
    | |--Concatenation
    | |--Index Scan(OBJECT:([Prod].[dbo].[xmku_miss].[PK_xmku_miss]), WHERE:([xmku_miss].[cdj_date]='2008/11/11' AND [xmku_miss].[sec_no]='30013'))
    | |--Bookmark Lookup(BOOKMARK:([Bmk1002]), OBJECT:([Prod].[dbo].[btjk_miss]))
    | |--Index Scan(OBJECT:([Prod].[dbo].[btjk_miss].[IDX_BTJK_MISS_BTL_NO]), WHERE:([btjk_miss].[jk_date]='2008/11/11' AND [btjk_miss].[sec_no]='30013'))
    |--Index Seek(OBJECT:([Prod].[dbo].[pagodrm_miss].[PK_pagodrm_miss] AS ), SEEK:(.[podr_no]=[Union1004]) ORDERED FORWARD)
  6. Jelly0228 New Member

    My god, Icon Beer stands for TABLE B.
  7. FrankKalis Moderator

    [quote user="Jelly0228"]
    My god, Icon Beer stands for TABLE B.
    [/quote]
    Interesting. I didn't know that. [:D]
  8. Jelly0228 New Member

    Hi all, I found a solution to this issue, but I still couldn't explain why. Comparing code 1 and code 2, you will find that the only difference is UNION and UNION ALL. Any one has ideas about this?
    SQL CODE 1: (took 1 second)
    select distinct b.demo_no,artic_no
    from
    (
    select podr_no from xmku_miss
    where cdj_date='2008/11/11'
    and sec_no='30013'
    union all
    select podr_no from btjk_miss
    where jk_date='2008/11/11'
    and sec_no='30013' ) a
    join pagodrm_miss b on a.podr_no=b.podr_no
    SQL CODE 2: (took 9 second)
    select distinct b.demo_no,artic_no
    from
    (
    select podr_no from xmku_miss
    where cdj_date='2008/11/11'
    and sec_no='30013'
    union
    select podr_no from btjk_miss
    where jk_date='2008/11/11'
    and sec_no='30013' ) a
    join pagodrm_miss b on a.podr_no=b.podr_no

  9. Adriaan New Member

    You need to read up on T-SQL basics ...
    UNION ALL returns all rows without checking for uniqueness.
    UNION eliminates duplicates from the results. This is an additional step in the process (compare the execution plans) and therefore takes more time than UNION ALL.
  10. jagblue New Member

Share This Page