Query Performance VLDB | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Query Performance VLDB

I have a table with about 200 million rows. (The table will be over 500 million rows when all data is loaded). There is a possibility of getting even more data that would put put this table close to 1 Billion records. I need to query this table filtering out rows by joining 3 other tables.
The 3 joined tables are a list of Stores (1418 records), a list of SKU’s (833 records) and a list of dates (13 records). zItems: Both fields are a composite clustered index
SKUID INT
UserID VARCHAR(20) zStores: Both fields are a composite clustered index
StoreID INT
UserID VARCHAR(20) zWDate: Both fields are a composite clustered index
WDate SMALLDATETIME
UserID VARCHAR(20) Sales: WDate, StoreID, SKUID are composite clustered index
WDate SMALLDATETIME
StoreID INT
SKUID INT
UnitsSold SMALLINT
ProdSales Real Basically, the query totals up the sales data for the stores, items, dates held in the temp tables. These temp tables are populated by multiple users (hence the userid field).
SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
FROM dbo.Sales WITH (NOlOCK)
RIGHT OUTER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
RIGHT OUTER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
RIGHT OUTER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
WHERE (dbo.zWDate.UserID = ‘TEST’)
AND (dbo.zStores.UserID = ‘TEST’)
AND (dbo.zItems.UserID = ‘TEST’)
GROUP BY dbo.Sales.SKUID This is running on a server with 4 2.8 CPU’s, 2GB RAM, 300GB running on Win2k SQL Server 2000 Standard. This query takes 2.5 minutes to return 847 records. (This is directly on the server so there is no network lag). This server is currently not being used so this is the only activity on the server. Monitors do not show any problem with CPU, Memory, or I/O. The box runs about 30% on 1 CPU and small amounts on the others. SQL is configured to use all CPU’s and the query threshold is the default 5 seconds. I have run Index tuning wizard and it suggested adding a cover index for WDate, StoreID, SKUID, UnitsSold, ProdSales. I added this and it made very little difference. (only a few seconds. It added so much space the the DB file that it wasn’t worth the few seconds it gained me, so I took it off. Is there anything else I can do to improve the speed of this query or am I asking too much of SQL on this server? Any help will be greatly appreciated. Thanks
Try adding HASH JOIN hint SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
FROM dbo.Sales WITH (NOlOCK)
RIGHT OUTER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
RIGHT OUTER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
RIGHT OUTER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
WHERE (dbo.zWDate.UserID = ‘TEST’)
AND (dbo.zStores.UserID = ‘TEST’)
AND (dbo.zItems.UserID = ‘TEST’)
GROUP BY dbo.Sales.SKUID
OPTION (HASH JOIN)

I have a test version of the DB on my laptopm that has 7 million rows in the sales table. Testing the query I posted on this test db it took 7 seconds to run the query.
Adding the OPTION (HASH JOIN) increased the query time to 13 seconds.
So I used OPTION (LOOP JOIN) and this cut the query time from 7 seconds to 5 seconds Tomorrow I’ll try this on the full database to see what kind of performance boost it will give. Every bit helps, but I still nee to shave more time off this query. Any more suggestions? Thanks vaxman, mcatet
Hi ya, what is it that you’re actually trying to achieve? The right outer join is cancelled out by the where clause, and having the three tables joined in means that it must match all three tables. Your query at the moment is the same as saying SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
FROM dbo.Sales WITH (NOlOCK)
INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
WHERE (dbo.zWDate.UserID = ‘TEST’)
AND (dbo.zStores.UserID = ‘TEST’)
AND (dbo.zItems.UserID = ‘TEST’)
GROUP BY dbo.Sales.SKUID if this what you intend then make sure that your indexes on the three smaller tables have the userid first then followed by the key field. If this is not what you intend, then perhaps you can give a small example of the data that would be in each table and what result you expect? Cheers
Twan

Basically, what I’m trying to do is sum up unitssold and prodsales for a specific set of dates and for a specific set of storeID’s and SKUID’s that each user specifies when they run a report. Users run reports simultaneously so the userID is needed in order to keep one user from over Similar to this query: SELECT …
FROM Sales
WHERE (WDate Between ‘1/1/2004’ AND ‘7/1/2004’) AND StoreID IN (1,2,3,4)
GROUP BY Sales.SKUID
HAVING SKUID IN (100, 200, 300, 400) Except that I have 1418 storeID’s and 833 SKUID’s. Using the IN Clause is slower and is a pain for development. So I switched to using the joined tables. Initially I tried using INNER JOIN but the RIGHT OUTER JOIN actually gave me the same results in less time. This query runs about 10% faster using RIGHT OUTER instead of INNER. This is a sample of what is in each table. Sales: WDate StoreID SKUID UnitsSold ProdSales SalesType
4/10/2004 1 100 5 10.00 1
4/10/2004 1 200 1 2.00 1
4/10/2004 2 100 3 6.00 1
4/10/2004 2 200 3 6.00 1 zWDate:
WDate UserID
4/10/2004 TEST
4/17/2004 TEST
4/24/2004 TEST
5/1/2004 TEST
4/17/2004 DEVTEST zItems:
SKUID UserID
100 TEST
200 TEST
100 DEVTEST
200 DEVTEST zStores:
StoreID UserID
1 TEST
2 TEST
3 TEST
1 DEVTEST
3 DEVTEST
5 DEVTEST Each of the joined tables (z…) has a composite key consisting of both fields to make it unique. I had to do this because each user runs reports that may have different lists of stores, skus and dates. So the query is using UserID to filter the records in zWDate, zItems, zStores in order to get the list of Dates, StoreID’s, SKUID’s which is the key to the sales table. I tried ordering the fields with UserID first but it made no difference in the execution time. Thanks,
mcatet

alrighty then, Have you tried: SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) as TotalSales
FROM dbo.Sales WITH (NOlOCK)
WHERE dbo.Sales.StoreID in (select dbo.zStores.StoreID from dbo.zStores where dbo.zStores.UserID = ‘TEST’ )
and dbo.Sales.SKUID in (select dbo.zItems.SKUID from dbo.zItems where dbo.zItems.UserID = ‘TEST’ )
and dbo.Sales.WDate in (select dbo.zWDate.WDate from dbo.zWDate where dbo.zWDate.UserID = ‘TEST’ )
GROUP BY dbo.Sales.SKUID what are the indexes on the sales table? This query could benefit from the covering index you mentioned earlier, ensuring that the most selective fields are first (out of storeid, skuid and wdate) What is the execution plan that it is using? The fact that it is only using a single CPU and not very much would seem to indicate that there is physical IO going on, which should implay that running the statement a second time would make a difference… What do you get when you run the statement with io stats on? Cheers
Twan
When I tried it before posting the HASH JOIN resulted in fewer logical reads. Twan’s suggestion resulted in the same number of reads as the original. You might also try MERGE JOIN which results in the fewest number of IO operations but takes more CPU time (at least on SQL 2005). As Twan suggested, SET STATISTICS IO ON (and SET STATISTICS PROFILE ON) and see if that helps to identify bottlenecks.

I now have this query down from 2:24 to 0:35. Still seems like it should run faster though. SELECT dbo.Sales.SKUID AS ProdID, SUM(dbo.Sales.UnitsSold) AS UnitsSold, SUM(dbo.Sales.ProdSales) AS TotalSales
FROM dbo.Sales
INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
WHERE (dbo.zWDate.UserID = ‘TEST’)
AND (dbo.zStores.UserID = ‘TEST’)
AND (dbo.zItems.UserID = ‘TEST’)
GROUP BY dbo.Sales.SKUID
OPTION (MERGE JOIN) The differentce is using MERGE JOIN. (More on this in a minute). I tried different lock options and different JOIN options and this was the fastest. These leads me to two observations that puzzle me (They may be related). 1) MERGE JOIN is the fastest option on the large DB (200 million rows) but it is the slowest on the small DB (7 million rows). LOOP Join is the slowest on the large DB but is the fastest on the small DB. I have several of these DB’s with the same structure just different data in them. Some are large, some are small. They each have their own application but it is the same code. So I need to programmatically write the queries to use MERGE JOIN or LOOP JOIN depending on the size of the DB. I can’t have common code for each application. 2) On the server I have the large DB and the small DB. I open query analyzer and bring up two windows (1 for each of the DB’s) and paste the same SQL (as above) in each window. The query execution plans are significantly different. The samll DB has 77% of the cost doing an index seek on the sales table cover index.
zStores, zItems, and zWDate all have 0% of the cost. This index seek is executed 5000 times The large DB has 17% of the cost on the cover index, but 30% on the PK index on the zStores table.
zItems and zWDate have 0% of the cost. This index seek is executed more than 2.5 million times. Why would the small DB spend the majority of the time on the sales index and next to no time on the joined tables indexes while the large db spends most of its time on 1 of the joined tables. This is using the LOOP JOIN for both. The plans for MERGE JOIN are more alike than LOOP JOIN, but MERGE JOIN on the small DB takes 5 times longer to execute. STATS ON LARGE DB USING MERGE JOIN:
(763 row(s) affected) Table ‘zItems’. Scan count 4, logical reads 5, physical reads 2, read-ahead reads 3.
Table ‘zStores’. Scan count 4, logical reads 7, physical reads 2, read-ahead reads 4.
Table ‘zWDate’. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0.
Table ‘Sales’. Scan count 13, logical reads 264777, physical reads 0, read-ahead reads 220294. SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 36787 ms. ===============================================================
STATS ON SMALL DB USING MERGE JOIN:
(763 row(s) affected) Table ‘zItems’. Scan count 1, logical reads 8, physical reads 1, read-ahead reads 7.
Table ‘Sales’. Scan count 1, logical reads 23695, physical reads 2, read-ahead reads 23787.
Table ‘zWDate’. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0.
Table ‘zStores’. Scan count 1, logical reads 8, physical reads 1, read-ahead reads 7. SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 10704 ms.
Anything seem wrong about the reads on the sales tables? You help on this is greatly appreciated. mcatet
Hard to tell without knowing how many rows are actually being used. Since you are summing on UnitsSold and ProdSales, there could be &gt; 23,000 rows required to agregate into 763 rows. What is the query plan it is using? I would think your existing clustered index would be highly selective. I am suprised it takes 10 seconds.<br /><br />You might try creating an indexed view (not tested pseudo code<img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /><br /><br />SET ARITHABORT ON<br /><br />Create view v_Sales WITH SchemaBinding AS<br /><br />Select dbo.Sales.WDate, dbo.Sales.StoreID, dbo.Sales.SKUID, <br /> dbo.zStores.UserID, dbo.Sales.UnitsSold, dbo.Sales.ProdSales<br />FROM dbo.Sales<br />INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID <br />INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID <br />INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate<br />WHERE dbo.zWDate.UserID = dbo.zStores.UserID <br />AND dbo.zStores.UserID = dbo.zItems.UserID <br /><br /><br />CREATE UNIQUE CLUSTERED INDEX v_s_uidx on v_Sales (Wdate, StoreID, SKUID)<br />Create INDEX v_s_UserID_IDX on v_Sales (UserID)<br /><br />Then:<br /><br />SELECT v_sales.SKUID AS ProdID, SUM(v_sales.UnitsSold) AS UnitsSold, SUM(v_sales.ProdSales) AS TotalSales<br />FROM v_Sales<br />WHERE UserID = ‘TEST'<br />GROUP BY SKUID<br />
If you were going to create an indexed view (and this statement is the only one of its kind) then you’d be better off SET ARITHABORT ON Create view v_Sales WITH SchemaBinding AS Select dbo.Sales.SKUID,
dbo.zStores.UserID, SUM(dbo.Sales.UnitsSold) SumUnitsSold, SUM(dbo.Sales.ProdSales) SumSales
FROM dbo.Sales
INNER JOIN dbo.zStores ON dbo.Sales.StoreID = dbo.zStores.StoreID
INNER JOIN dbo.zItems ON dbo.Sales.SKUID = dbo.zItems.SKUID
INNER JOIN dbo.zWDate ON dbo.Sales.WDate = dbo.zWDate.WDate
WHERE dbo.zWDate.UserID = dbo.zStores.UserID
AND dbo.zStores.UserID = dbo.zItems.UserID
group by dbo.zstores.UserID, dbo.Sales.SKUID
CREATE UNIQUE CLUSTERED INDEX v_s_uidx on v_Sales (UserID, SKUID)
That would be quick to select from, but have more overhead on inserting/updating… Cheers
Twan
Actually the number of read-aheads compared to logical reads seems high, and the cpu time to elapsed time seems odd too…? How fragmented are the tables/indexes? How fragmented is the disk subsystem? Others may have more of an idea about why they may be out of line with what I would have expected…? I would expect the cpu time to be almost all of the elapsed time, and would expect the read-aheads not to be as high as they are…? Cheers
Twan
I think he would have to include WDate in the clustered unique index for the view (and a unique clustered index is required) because that is the primary key of the underlying sales table. It also occurs to me that you may have a memory or paging problem. 23,000 logical io’s (from cache) should not take 10 seconds.

My bad Twan. I did not notice you added the group by to the view. Yes that seems better.
Thanks to some of the pointers from Twan and Vaxman, I have optimized my queries. I actually have 3 main queries (I only posted 1) that are very similar just different group by’s primarily. Through testing I found that each of the queries used different JOIN types. In addition, If I had more rows in the store table and less in the items table the JOIN types were different. So I setup a test VB test app that execueted each of my 3 main queries using each of these JOIN types: 1) RIGHT OUTER JOIN
2) MERGE JOIN
3) LOOP JOIN
4) HASH JOIN for each these criteria: 1) All Items, All Stores
2) Small list of items, Small list of stores
3) Small list of items, All stores
4) All items, Small list of stores and recorded the execution time. This allowed me to determine the best performing query for each of the 4 criteria for each of the 3 main queries. So I coded the application to use a certain JOIN type for each criteria based on a cutoff value for the number of stores and number of items. I am in the process of building another test function to insert blocks of records into the items and stores tables, execute the queries, and record the execution time so I can find the optimal cutoff values for items and stores (in other words if I have 300 stores in the store table it may run faster with one JOIN type than the sme JOIN type with 500 stores, likewise for items). The query I posted runs in 38 seconds (MERGE JOIN) for all items and all stores. Using 310 items and 134 stores it runs in 3 seconds (RIGHT OUTER JOIN). FYI: The 310/134 query runs in 2 minutes as a MERGE JOIN. Thanks for your help.

In 8 years of SQL programming I have never had to specify JOIN hints. Index tuning, proper statistics , and examining the query plans and output from SET STATISTICS PROFILE ON have always led to an optimal solution. I have occasionally specified selected indexes and rephrased queries, joins, and subqueries (probably achieving the alternate JOIN methods indirectly!) I’d like to get my hands on your data just to dig around in it. <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /> I have never gotten close to a billion row table though! 20 million rows is about as large as I see.<br />
In my 9 years of development I’ve never had to programmatically change the query depending on the number of records I’m expecting, regardless of the language or the backend DB. Changing queries based on criteria, yes but never by expected results. It’s not a great development experience but there is a dramatic performance different in changing the JOIN type. But I’ve never had a billion records in 1 table before either. Perhaps if we had a larger server this would not be an issue, but as you know big servers cost big bucks that we don’t have at this time. I’m putting together a report showing my findings in the hoips that I can persuade the execs to spend some more money on a larger server. Thanks again for all your help.
Do you processors have HTT enabled on them? if so try turning them of and try the query. I#%92m having a similar problem with one of my boxes 4 2ghz it is running slower then a box with 2 870mhz and the bigger box is not even active
Just a suggestion.
KENNY
I’m going to pull my freaking hair out. Why do you people keep saying this stuff????? I run HTT on ALL my SQL Servers and have ZERO problems. Tune your SQL Servers….tune them…..tune them…..tune them. Bath water is dirty.
Throw out the baby. MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.
I must agree with Derrick, I also run all servers with HTT and haven’t had any issues to date. I always find it amusing that we tend to look at hardware issues first sometimes without looking at the application software… 80-90% of performance issues are due to inefficient coding in the application… However, this keeps me gainfully emplyed in the field of application tuning, so I can’t complain <img src=’/community/emoticons/emotion-5.gif’ alt=’;-)’ /><br /><br />Cheers<br />Twan
]]>