SQL Server Performance

Please help with the hash join performance

Discussion in 'ALL SQL SERVER QUESTIONS' started by xiebo2010cx, Sep 27, 2013.

  1. xiebo2010cx Member

    SQL Gurus, please help with my prod SQL join performance. it is really killing me...

    Table combineddata has 15 million rows
    Table tblSpreadsStage1 has 70million rows

    The below join takes too long, checked the execution plan, HASH MATCH takes 90% execution time.

    1. already have index on join columns and where columns, want to seek the best ideas from Gurus on indexing here.
    2. how can I re-write the the query to make it more efficient. now HASH JOIN takes 90% execution time...

    SQL query plan is attached, please change to .sqlplan post-fix before opening it.

    Thank you so, so much!!!

    SELECT co1.cusip ,
    AVG(co1.Yield) AS cusip1avgyield,
    STDEV(co1.Yield) AS cusip1stddev,

    co2.cusip AS cusip2 ,
    AVG(co2.Yield) AS cusip2avgyield,
    STDEV(co2.Yield) AS cusip2stddev,

    COUNT(co1.cusip + co2.cusip) AS datacount ,
    AVG(co1.Yield - co2.Yield) AS average ,

    CASE WHEN STDEV(co1.Yield - co2.Yield) = 0
    THEN .0000000000000001
    ELSE ISNULL(STDEV(co1.Yield - co2.Yield),
    .0000000000000001)
    END AS standarddev

    INTO [tblSpreadsStage2]
    FROM dbo.combineddata co1 ,
    dbo.combineddata co2 ,
    dbo.tblSpreadsStage1 o
    WHERE co1.Date = co2.date
    AND co1.cusip = o.cusip
    AND co2.Cusip = o.cusip2
    AND co1.yield IS NOT NULL
    AND co2.yield IS NOT NULL
    AND co1.Yield != 0
    AND co2.Yield != 0
    -- and co1.Yield != co2.Yield
    GROUP BY co1.cusip ,
    co2.cusip
  2. davidfarr Member

    It is difficult to evaluate that query without having access to your actual data to view some samples of the values in each table, to check for duplication, etc.

    How long does the query currently run on your server?
    You mention 70 million records and 15 million records, 3 joins involved including the 15 million table joining on itself. That will take a few minutes on most servers.

    If you have a fairly good server and you are waiting more than 5 minutes for the query to complete then I suspect that the columns that you are joining on do not contain a unique combination of values.

    Will cusip and cusip2 in tblSpreadsStage1 always contain unique values that are different from each other?
    Do the fields for Date and Cusip in Combineddata table always contain a unique combination of values?
    The join of "co1.Date = co2.date" seems especially risky because that table is joining on itself and a date field in such a large table is especially prone to having duplicate values somewhere.

    If relevant join columns contain duplicate row values on both tables participating in the join then the execution will multiply excessively (eg, 4 duplicates = 16 joins) and greatly slow the execution.
  3. xiebo2010cx Member

    thank you so much for your time, David,I really appreciate it.

    The query generally runs for 2 hours. I am tasked to fine tune it, either by indexing, or re-write the query, For indexing part, seems it has all necessary indexes on the columns.

    The combination of cusip and cusip2 in tblSpreadsStage1 are unique values.

    The combination of Date and Cusip in Combineddata table are also unique values.

    The Date column is smalldatetime, has a lot of duplicate dates, for each date value, there are around 14K duplicates, that is the part.

    Thank you so much, please let me know if you need more info...
  4. davidfarr Member

    I will try to make time later in the week to try work out a more suitable query for this (my own work schedule is a bit heavy this week);
    but I am already fairly sure that the "co1.Date = co2.date" join is the main cause of the slow execution. At 14K duplicates, joining on itself, that join is potentially executing up to 196 million times (14K x 14K) per date. A 2 hour execution is definitely indicating a problem with the query.

    I accept that the date cannot simply be removed from the query, because I can see that you need to group the Yields for cusip1 and cusip2 for the same date.
    A better way to apply this date join will require some further thought, I hope to be able to reply again at a later date.
  5. davidfarr Member

    I have put this suggestion together 'blindly', because I have no way to test the query or its actual performance on your data. It is still a long-running query, but hopefully a lot less than 2 hours.
    Code:
    --drop table #temp1
    --drop table tempStage3
    
    select distinct [Date] into #temp1 from combineddata
    GO
    select distinct o.cusip, o.cusip2, t1.[Date] into tempStage3
    from tblSpreadsStage1 o
    cross join #temp1 t1
    --This is a large table result, but will help with join execution later.
    --The purpose of this to create a unique combination of cusip,cusip2 and Date
    --so that the next join is a one-to-many and not a many-to-many.
    --You could also include this part earlier when creating the tblSpreadsStage1 table.
    GO
    create nonclustered index IX_cusip on tempStage3 (cusip)
    create nonclustered index IX_cusip2 on tempStage3 (cusip2)
    create nonclustered index IX_Date on tempStage3 ([Date])
    GO
    
    SELECT t1.cusip ,
    AVG(co1.Yield) AS cusip1avgyield,
    STDEV(co1.Yield) AS cusip1stddev,
    
    t1.cusip2 ,
    AVG(co2.Yield) AS cusip2avgyield,
    STDEV(co2.Yield) AS cusip2stddev,
    
    COUNT(co1.cusip + co2.cusip) AS datacount,
    AVG(co1.Yield - co2.Yield) AS average ,
    
    CASE WHEN ISNULL(STDEV(co1.Yield - co2.Yield),0) = 0
    THEN .0000000000000001
    ELSE STDEV(co1.Yield - co2.Yield)
    END AS standarddev
    
    INTO [tblSpreadsStage2]
    FROM dbo.tempStage3 t1,
    dbo.combineddata co1 ,
    dbo.combineddata co2
    WHERE t1.cusip = co1.cusip AND t1.[Date] = co1.[Date]
    AND t1.cusip2 = co2.cusip AND t1.[Date] = co2.[Date]
    AND co1.Yield != 0
    AND co2.Yield != 0
    -- and co1.Yield != co2.Yield
    GROUP BY t1.cusip,t1.cusip2
    
    Also; because this query above is now joining on both [cusip] and [Date] in table [combineddata], a composite index on those columns will help too.
    create nonclustered index IX_cusipdate on combineddata (cusip,[date]);
  6. AJITH123 Member

    There are twio things, first needs to check the server configuration, server may need more physical memory ans see teh allocated memmory for SQL server. Second is, when the table indexes are last rebuit and is there any block in these tables. Are you put any table hint while accessing the table data? also if you are dealing with millions of data, try to make it as smaller chunk than bulk.

Share This Page