SQL Server Performance

Query Performance VLDB

Discussion in 'T-SQL Performance Tuning for Developers' started by mcatet, Aug 3, 2004.

  1. mcatet New Member

    I have a table with about 200 million rows. (The table will be over 500 million rows when all data is loaded). There is a possibility of getting even more data that would put put this table close to 1 Billion records.

    I need to query this table filtering out rows by joining 3 other tables.
    The 3 joined tables are a list of Stores (1418 records), a list of SKU's (833 records) and a list of dates (13 records).

    zItems: Both fields are a composite clustered index
    SKUID INT
    UserID VARCHAR(20)

    zStores: Both fields are a composite clustered index
    StoreID INT
    UserID VARCHAR(20)

    zWDate: Both fields are a composite clustered index
    WDate SMALLDATETIME
    UserID VARCHAR(20)

    Sales: WDate, StoreID, SKUID are composite clustered index
    WDate SMALLDATETIME
    StoreID INT
    SKUID INT
    UnitsSold SMALLINT
    ProdSales Real

    Basically, the query totals up the sales data for the stores, items, dates held in the temp tables. These temp tables are populated by multiple users (hence the userid field).


    SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
    FROM dbo.Sales WITH (NOlOCK)
    RIGHT OUTER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
    RIGHT OUTER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
    RIGHT OUTER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
    WHERE (dbo.zWDate.UserID = 'TEST')
    AND (dbo.zStores.UserID = 'TEST')
    AND (dbo.zItems.UserID = 'TEST')
    GROUP BY dbo.Sales.SKUID

    This is running on a server with 4 2.8 CPU's, 2GB RAM, 300GB running on Win2k SQL Server 2000 Standard. This query takes 2.5 minutes to return 847 records. (This is directly on the server so there is no network lag). This server is currently not being used so this is the only activity on the server. Monitors do not show any problem with CPU, Memory, or I/O. The box runs about 30% on 1 CPU and small amounts on the others. SQL is configured to use all CPU's and the query threshold is the default 5 seconds.

    I have run Index tuning wizard and it suggested adding a cover index for WDate, StoreID, SKUID, UnitsSold, ProdSales. I added this and it made very little difference. (only a few seconds. It added so much space the the DB file that it wasn't worth the few seconds it gained me, so I took it off.

    Is there anything else I can do to improve the speed of this query or am I asking too much of SQL on this server?

    Any help will be greatly appreciated.

    Thanks

  2. vaxman New Member

    Try adding HASH JOIN hint

    SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
    FROM dbo.Sales WITH (NOlOCK)
    RIGHT OUTER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
    RIGHT OUTER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
    RIGHT OUTER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
    WHERE (dbo.zWDate.UserID = 'TEST')
    AND (dbo.zStores.UserID = 'TEST')
    AND (dbo.zItems.UserID = 'TEST')
    GROUP BY dbo.Sales.SKUID
    OPTION (HASH JOIN)
  3. mcatet New Member

    I have a test version of the DB on my laptopm that has 7 million rows in the sales table.

    Testing the query I posted on this test db it took 7 seconds to run the query.
    Adding the OPTION (HASH JOIN) increased the query time to 13 seconds.
    So I used OPTION (LOOP JOIN) and this cut the query time from 7 seconds to 5 seconds

    Tomorrow I'll try this on the full database to see what kind of performance boost it will give.

    Every bit helps, but I still nee to shave more time off this query. Any more suggestions?

    Thanks vaxman,

    mcatet
  4. Twan New Member

    Hi ya,

    what is it that you're actually trying to achieve?

    The right outer join is cancelled out by the where clause, and having the three tables joined in means that it must match all three tables. Your query at the moment is the same as saying

    SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
    FROM dbo.Sales WITH (NOlOCK)
    INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
    INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
    INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
    WHERE (dbo.zWDate.UserID = 'TEST')
    AND (dbo.zStores.UserID = 'TEST')
    AND (dbo.zItems.UserID = 'TEST')
    GROUP BY dbo.Sales.SKUID

    if this what you intend then make sure that your indexes on the three smaller tables have the userid first then followed by the key field.

    If this is not what you intend, then perhaps you can give a small example of the data that would be in each table and what result you expect?

    Cheers
    Twan
  5. mcatet New Member

    Basically, what I'm trying to do is sum up unitssold and prodsales for a specific set of dates and for a specific set of storeID's and SKUID's that each user specifies when they run a report. Users run reports simultaneously so the userID is needed in order to keep one user from over

    Similar to this query:

    SELECT ...
    FROM Sales
    WHERE (WDate Between '1/1/2004' AND '7/1/2004') AND StoreID IN (1,2,3,4)
    GROUP BY Sales.SKUID
    HAVING SKUID IN (100, 200, 300, 400)

    Except that I have 1418 storeID's and 833 SKUID's. Using the IN Clause is slower and is a pain for development. So I switched to using the joined tables.

    Initially I tried using INNER JOIN but the RIGHT OUTER JOIN actually gave me the same results in less time. This query runs about 10% faster using RIGHT OUTER instead of INNER.

    This is a sample of what is in each table.

    Sales:

    WDate StoreID SKUID UnitsSold ProdSales SalesType
    4/10/2004 1 100 5 10.00 1
    4/10/2004 1 200 1 2.00 1
    4/10/2004 2 100 3 6.00 1
    4/10/2004 2 200 3 6.00 1

    zWDate:
    WDate UserID
    4/10/2004 TEST
    4/17/2004 TEST
    4/24/2004 TEST
    5/1/2004 TEST
    4/17/2004 DEVTEST

    zItems:
    SKUID UserID
    100 TEST
    200 TEST
    100 DEVTEST
    200 DEVTEST

    zStores:
    StoreID UserID
    1 TEST
    2 TEST
    3 TEST
    1 DEVTEST
    3 DEVTEST
    5 DEVTEST

    Each of the joined tables (z...) has a composite key consisting of both fields to make it unique. I had to do this because each user runs reports that may have different lists of stores, skus and dates.

    So the query is using UserID to filter the records in zWDate, zItems, zStores in order to get the list of Dates, StoreID's, SKUID's which is the key to the sales table.

    I tried ordering the fields with UserID first but it made no difference in the execution time.

    Thanks,
    mcatet
  6. Twan New Member

    alrighty then, Have you tried:

    SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
    FROM dbo.Sales WITH (NOlOCK)
    WHERE dbo.Sales.StoreID in (select dbo.zStores.StoreID from dbo.zStores where dbo.zStores.UserID = 'TEST' )
    and dbo.Sales.SKUID in (select dbo.zItems.SKUID from dbo.zItems where dbo.zItems.UserID = 'TEST' )
    and dbo.Sales.WDate in (select dbo.zWDate.WDate from dbo.zWDate where dbo.zWDate.UserID = 'TEST' )
    GROUP BY dbo.Sales.SKUID

    what are the indexes on the sales table? This query could benefit from the covering index you mentioned earlier, ensuring that the most selective fields are first (out of storeid, skuid and wdate)

    What is the execution plan that it is using? The fact that it is only using a single CPU and not very much would seem to indicate that there is physical IO going on, which should implay that running the statement a second time would make a difference...

    What do you get when you run the statement with io stats on?

    Cheers
    Twan
  7. vaxman New Member

    When I tried it before posting the HASH JOIN resulted in fewer logical reads. Twan's suggestion resulted in the same number of reads as the original. You might also try MERGE JOIN which results in the fewest number of IO operations but takes more CPU time (at least on SQL 2005). As Twan suggested, SET STATISTICS IO ON (and SET STATISTICS PROFILE ON) and see if that helps to identify bottlenecks.
  8. mcatet New Member

    I now have this query down from 2:24 to 0:35. Still seems like it should run faster though.

    SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) AS TotalSales
    FROM dbo.Sales
    INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
    INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
    INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
    WHERE (dbo.zWDate.UserID = 'TEST')
    AND (dbo.zStores.UserID = 'TEST')
    AND (dbo.zItems.UserID = 'TEST')
    GROUP BY dbo.Sales.SKUID
    OPTION (MERGE JOIN)

    The differentce is using MERGE JOIN. (More on this in a minute). I tried different lock options and different JOIN options and this was the fastest.

    These leads me to two observations that puzzle me (They may be related).

    1) MERGE JOIN is the fastest option on the large DB (200 million rows) but it is the slowest on the small DB (7 million rows). LOOP Join is the slowest on the large DB but is the fastest on the small DB. I have several of these DB's with the same structure just different data in them. Some are large, some are small. They each have their own application but it is the same code. So I need to programmatically write the queries to use MERGE JOIN or LOOP JOIN depending on the size of the DB. I can't have common code for each application.

    2) On the server I have the large DB and the small DB. I open query analyzer and bring up two windows (1 for each of the DB's) and paste the same SQL (as above) in each window. The query execution plans are significantly different.

    The samll DB has 77% of the cost doing an index seek on the sales table cover index.
    zStores, zItems, and zWDate all have 0% of the cost. This index seek is executed 5000 times

    The large DB has 17% of the cost on the cover index, but 30% on the PK index on the zStores table.
    zItems and zWDate have 0% of the cost. This index seek is executed more than 2.5 million times.

    Why would the small DB spend the majority of the time on the sales index and next to no time on the joined tables indexes while the large db spends most of its time on 1 of the joined tables.

    This is using the LOOP JOIN for both. The plans for MERGE JOIN are more alike than LOOP JOIN, but MERGE JOIN on the small DB takes 5 times longer to execute.

    STATS ON LARGE DB USING MERGE JOIN:
    (763 row(s) affected)

    Table 'zItems'. Scan count 4, logical reads 5, physical reads 2, read-ahead reads 3.
    Table 'zStores'. Scan count 4, logical reads 7, physical reads 2, read-ahead reads 4.
    Table 'zWDate'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0.
    Table 'Sales'. Scan count 13, logical reads 264777, physical reads 0, read-ahead reads 220294.

    SQL Server Execution Times:
    CPU time = 125 ms, elapsed time = 36787 ms.

    ===============================================================
    STATS ON SMALL DB USING MERGE JOIN:
    (763 row(s) affected)

    Table 'zItems'. Scan count 1, logical reads 8, physical reads 1, read-ahead reads 7.
    Table 'Sales'. Scan count 1, logical reads 23695, physical reads 2, read-ahead reads 23787.
    Table 'zWDate'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0.
    Table 'zStores'. Scan count 1, logical reads 8, physical reads 1, read-ahead reads 7.

    SQL Server Execution Times:
    CPU time = 125 ms, elapsed time = 10704 ms.


    Anything seem wrong about the reads on the sales tables?

    You help on this is greatly appreciated.

    mcatet
  9. vaxman New Member

    Hard to tell without knowing how many rows are actually being used. Since you are summing on UnitsSold and ProdSales, there could be &gt; 23,000 rows required to agregate into 763 rows. What is the query plan it is using? I would think your existing clustered index would be highly selective. I am suprised it takes 10 seconds.<br /><br />You might try creating an indexed view (not tested pseudo code<img src='/community/emoticons/emotion-1.gif' alt=':)' /><br /><br />SET ARITHABORT ON<br /><br />Create view v_Sales WITH SchemaBinding AS<br /><br />Select dbo.Sales.WDate, dbo.Sales.StoreID, dbo.Sales.SKUID, <br /> dbo.zStores.UserID, dbo.Sales.UnitsSold, dbo.Sales.ProdSales<br />FROM dbo.Sales<br />INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID <br />INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID <br />INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate<br />WHERE dbo.zWDate.UserID = dbo.zStores.UserID <br />AND dbo.zStores.UserID = dbo.zItems.UserID <br /><br /><br />CREATE UNIQUE CLUSTERED INDEX v_s_uidx on v_Sales (Wdate, StoreID, SKUID)<br />Create INDEX v_s_UserID_IDX on v_Sales (UserID)<br /><br />Then:<br /><br />SELECT v_sales.SKUID AS ProdID, SUM(v_sales.UnitsSold) AS UnitsSold, SUM(v_sales.ProdSales) AS TotalSales<br />FROM v_Sales<br />WHERE UserID = 'TEST'<br />GROUP BY SKUID<br />
  10. Twan New Member

    If you were going to create an indexed view (and this statement is the only one of its kind) then you'd be better off

    SET ARITHABORT ON

    Create view v_Sales WITH SchemaBinding AS

    Select dbo.Sales.SKUID,
    dbo.zStores.UserID, SUM(dbo.Sales.UnitsSold) SumUnitsSold, SUM(dbo.Sales.ProdSales) SumSales
    FROM dbo.Sales
    INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
    INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
    INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
    WHERE dbo.zWDate.UserID = dbo.zStores.UserID
    AND dbo.zStores.UserID = dbo.zItems.UserID
    group by dbo.zstores.UserID, dbo.Sales.SKUID


    CREATE UNIQUE CLUSTERED INDEX v_s_uidx on v_Sales (UserID, SKUID)


    That would be quick to select from, but have more overhead on inserting/updating...

    Cheers
    Twan
  11. Twan New Member

    Actually the number of read-aheads compared to logical reads seems high, and the cpu time to elapsed time seems odd too...?

    How fragmented are the tables/indexes? How fragmented is the disk subsystem?

    Others may have more of an idea about why they may be out of line with what I would have expected...? I would expect the cpu time to be almost all of the elapsed time, and would expect the read-aheads not to be as high as they are...?

    Cheers
    Twan
  12. vaxman New Member

    I think he would have to include WDate in the clustered unique index for the view (and a unique clustered index is required) because that is the primary key of the underlying sales table. It also occurs to me that you may have a memory or paging problem. 23,000 logical io's (from cache) should not take 10 seconds.
  13. vaxman New Member

    My bad Twan. I did not notice you added the group by to the view. Yes that seems better.
  14. mcatet New Member

    Thanks to some of the pointers from Twan and Vaxman, I have optimized my queries.

    I actually have 3 main queries (I only posted 1) that are very similar just different group by's primarily. Through testing I found that each of the queries used different JOIN types. In addition, If I had more rows in the store table and less in the items table the JOIN types were different.

    So I setup a test VB test app that execueted each of my 3 main queries using each of these JOIN types:

    1) RIGHT OUTER JOIN
    2) MERGE JOIN
    3) LOOP JOIN
    4) HASH JOIN

    for each these criteria:

    1) All Items, All Stores
    2) Small list of items, Small list of stores
    3) Small list of items, All stores
    4) All items, Small list of stores

    and recorded the execution time. This allowed me to determine the best performing query for each of the 4 criteria for each of the 3 main queries.

    So I coded the application to use a certain JOIN type for each criteria based on a cutoff value for the number of stores and number of items.

    I am in the process of building another test function to insert blocks of records into the items and stores tables, execute the queries, and record the execution time so I can find the optimal cutoff values for items and stores (in other words if I have 300 stores in the store table it may run faster with one JOIN type than the sme JOIN type with 500 stores, likewise for items).

    The query I posted runs in 38 seconds (MERGE JOIN) for all items and all stores. Using 310 items and 134 stores it runs in 3 seconds (RIGHT OUTER JOIN). FYI: The 310/134 query runs in 2 minutes as a MERGE JOIN.

    Thanks for your help.


  15. vaxman New Member

    In 8 years of SQL programming I have never had to specify JOIN hints. Index tuning, proper statistics , and examining the query plans and output from SET STATISTICS PROFILE ON have always led to an optimal solution. I have occasionally specified selected indexes and rephrased queries, joins, and subqueries (probably achieving the alternate JOIN methods indirectly!) I'd like to get my hands on your data just to dig around in it. <img src='/community/emoticons/emotion-1.gif' alt=':)' /> I have never gotten close to a billion row table though! 20 million rows is about as large as I see.<br />
  16. mcatet New Member

    In my 9 years of development I've never had to programmatically change the query depending on the number of records I'm expecting, regardless of the language or the backend DB. Changing queries based on criteria, yes but never by expected results.

    It's not a great development experience but there is a dramatic performance different in changing the JOIN type. But I've never had a billion records in 1 table before either. Perhaps if we had a larger server this would not be an issue, but as you know big servers cost big bucks that we don't have at this time.

    I'm putting together a report showing my findings in the hoips that I can persuade the execs to spend some more money on a larger server.

    Thanks again for all your help.

  17. Brent New Member

    Do you processors have HTT enabled on them? if so try turning them of and try the query. I#%92m having a similar problem with one of my boxes 4 2ghz it is running slower then a box with 2 870mhz and the bigger box is not even active
    Just a suggestion.


    KENNY
  18. derrickleggett New Member

    I'm going to pull my freaking hair out. Why do you people keep saying this stuff????? I run HTT on ALL my SQL Servers and have ZERO problems. Tune your SQL Servers....tune them.....tune them.....tune them.

    Bath water is dirty.
    Throw out the baby.

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  19. Twan New Member

    I must agree with Derrick, I also run all servers with HTT and haven't had any issues to date. I always find it amusing that we tend to look at hardware issues first sometimes without looking at the application software... 80-90% of performance issues are due to inefficient coding in the application... However, this keeps me gainfully emplyed in the field of application tuning, so I can't complain <img src='/community/emoticons/emotion-5.gif' alt=';-)' /><br /><br />Cheers<br />Twan

Share This Page