SQL Server Performance Forum – Threads Archive

a better loop

Hi there, Here is my table: —————————————————-
CREATE TABLE [dbo].[MyTable] (
[ColumnA] [int] NOT NULL ,
[ColumnB] [datetime] NOT NULL ,
[ColumnC] [decimal](5, 4) NOT NULL
) ON [PRIMARY]
GO ALTER TABLE [dbo].[MyTable] WITH NOCHECK ADD
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED
(
[ColumnA]
) ON [PRIMARY]
GO —————————————————- this table has about 20 mil records and
there is a loop required for this table.
Here is my method:
…code from inside a stored procedure
—————————————————-
DECLARE @VarB datetime
DECLARE @VarC decimal(5,4) DECLARE MyCursor CURSOR FAST_FORWARD FOR
SELECT ColumnB, ColumnC FROM MyTable OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @VarB, @VarC
WHILE @@FETCH_STATUS = 0
BEGIN
–some processing
–it has to read row by row and do something FETCH NEXT FROM MyCursor INTO @VarB, @VarC
END
CLOSE MyCursor
DEALLOCATE MyCursor
—————————————————- for a simple processing block it takes about 18 min to run on my machine
(Athlon XP 2200, 760MB RAM and 35 GB free space). My computer is very slow
after executing this loop. Do you see any better solution to loop through this records?
I made MyTable smaller (about 1 mil) and run time was 1 min and 17 sec
which is greater than 1/20 of 18 min. Does it really help to split my table? Thank you, Bolo

What type of process are you doing using Cursors? Give more details Madhivanan Failing to plan is Planning to fail

Here is the alternative:
<some processing>
from myTable

Please post some more information in order to help you!
If you’re unsure what information is needed, have a look here:http://www.aspfaq.com/etiquette.asp?id=5006 —
Frank Kalis
Microsoft SQL Server MVP
http://www.insidesql.de
Heute schon gebloggt?http://www.insidesql.de/blogs

check this:
if you can’t avoid cursor
http://www.sql-server-performance.com/dp_no_cursors.asp

I’ve seen the method proposed in many articles. In almost each one there was the same claim that this method significantly improves performance over cursor solution. However, my test always showed comparable performance. Sometimes cursors were even slightly faster. The slowness of the cursor comes mostly from slowness of row by row processing, so replacing it with another row-by-row processing method doesn’t help performance much if at all.

Replacing Cursors with While loop is only alternative to Cursors. But it doesnt improve performance much. Only set based approach will play a major role [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] Madhivanan Failing to plan is Planning to fail

Which takes us back to the original question … Bolo, You have to give more details about the proecessing that you need to do. 9 times out of 10, processing can be done in a single UPDATE or INSERT statement with no looping through cursors. People who first start programming in T-SQL usually think in step-by-step processing like in procedural languages, but in most cases you can issue a single T-SQL statement to do the whole thing in a single step. You may well have that 1 in 10 example where you can’t avoid a cursor, but without the details we can’t tell.[<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]

<blockquote id="quote">quote:<hr height="1" noshade id="quote">Originally posted by Madhivanan Replacing Cursors with While loop is only alternative to Cursors. But it doesnt improve performance much. Only set based approach will play a major role [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] Madhivanan Failing to plan is Planning to fail <hr height="1" noshade id="quote"></blockquote id="quote">I would express what you had in mind differently: Besides data set approach, while loop is the only alternative to Cursors. However, while data set approach is much more efficient, row-by-row loops without cursor are equaly inefficient, sometimes marginaly faster, sometimes marginaly slower.

>>sometimes marginaly faster, sometimes marginaly slower Do you mean that it depends on the number of rows in the table? Madhivanan Failing to plan is Planning to fail

I don’t know. Since there was not significant performance difference I didn’t investigate the reason. Even for some cursor friendly solutions I’ve found dataset method that performs better. However, the best solution is to resolve such problems (where one column value depends on values from previous rows) on the client. Sql code should return just enough information, client (middle tier or presentation layer) should produce all derived information based on raw data.

By "sometimes faster, sometimes slower" I mean once I tested cusor was marginaly faster almost all the time, next time (could be slightly different loop solution or test table structure) it was marginaly slower.

Let’s not drown out Bolo on his first posting … … but anyway … If you use cursors where you don’t need them, in my experience you are also likely to use lookup queries to retrieve data related to the values you’ve fetched, etc. etc. Might not be too bad for a handful of rows being processed – then it is quite possibly performing better than set-based – but most definitely not if you need to process bigger numbers of rows. Then again this might be one of those processes that can only be done in a cursor.

<blockquote id="quote">quote:<hr height="1" noshade id="quote"> Let’s not drown out Bolo on his first posting … <hr height="1" noshade id="quote"></blockquote id="quote"> How nice! [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] But indeed we should let Bolo explain what his <blockquote id="quote">quote:<hr height="1" noshade id="quote"> –some processing –it has to read row by row and do something <hr height="1" noshade id="quote"></blockquote id="quote"> is all about. Anything else is like a shot in the dark. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a>

First of all I want to thank you for your feedback. â€œsome processingâ€ is something that has to be developed and will change from time to time. Here is a small example: MyTable is a historical table ColumnB is a datetime and ColumnC is decimal. Let#%92s say ColumnC is outside temperature for a certain date/time recorded in ColumnB. Temperature was measured every minute and that#%92s why MyTable ended up having 20 mil records. ColumnA is an EntryID primary key and was incremented by 1 as new observations were added to MyTable. ColumnB is sorted ASC as data was added so there is no need to sort on ColumnB. As MyCursor runs and @RecordDate and @Temperature get values there should be a process to write @RecordDate and @Temperature to a second table if @Temperature is greater than the average of previous readings for example. If @RecordDate is a Sunday then the average should look at previous Sundays only and so on.
This is what I call â€œsome processingâ€. Requirements for â€œsome processingâ€ my change from time to time and the above example is just a small scenario. I found cursors interesting and more like a VB approach (see Adriaan#%92s remark).
As I said my computer is slow after 20 mil loops even if statements like this
CLOSE MyCursor
DEALLOCATE MyCursor
are used.
Any feedback on this? And also what about having many historical tables instead of one? Like dividing MyTable into 20 MySmallTable having 1 mil records each? Thanks again and have a great day, Bolo

So you want to find all rows where the recorded temperature on ColumnC is higher than the average temperature for all previous recorded temperatures for the same weekday? Agreed: this is a complex query. But it can be done in a single statement, using a so called correlated subquery. Whether it performs better or worse than doing it one at a time in a cursor is rather hard to predict, though I’m pretty sure it will do better. Perhaps this is what you’re looking for … SELECT T1.* FROM MyTable T1
WHERE T1.ColumnC >
(SELECT AVG(T2.ColumnC) FROM MyTable T2
WHERE DATEPART(dw, T2.ColumnB) = DATEPART(dw, T1.ColumnB)
AND T2.ColumnA < T1.ColumnA) … or to be really tidy:
AND T2.ColumnB < T1.ColumnB) … as you should never assume that the values on the identity column are 100% dependable. Not unless you use a temp table with its own identity column and you’re inserting the data ordered by the datetime. The first row for each weekday will not be included in the resultset, as the subquery should return NULL for them. As time goes by, the average will level out somewhat, which has an effect on how far out a record must be to top the average, so I’m not sure if the simple average is really what your looking for.

Hi ya, to get any performance out of the query, you’d possibly need to create:
– an indexed view with the average temperatures for each weekday (since that would be 7 records…)
– appropriate indexes, possibly on computed columns BUT what to do will depend on exactly what you are trying to achieve, so difficult to give exact answers and possible approaches. In general as others have said it is best to stay away from cursors if possible Cheers
Twan

The problem is similar to the running totals problem I wrote article about. It should be published soon. I’ll post the link here when it happens. Basically, there is a data set solution for "running average problem" using "ordered update". First step is to insert data into another table, you can use temporary table, with the same structure except for additional running average column. Next step is update statement with order of updates forced and local variables keeping current aggregate value. This solution is faster then cursor method, but 20 million rows may still be too much. Probably it is good idea to do average calculations in chunks, but it doesn’t mean you have to split the table. You can make 10000 calculations in one batch and repeat it untill complete work is done. If datetime uniquely identify rows in that historical table there is no need to have identity column as primary key, you can use datetime column instead. I assume that table is not referenced by other tables in the database.

I’m looking forward for your article, Mirko! [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] Being in the financial business, I’ve tried to implement some technical stock analyis in SQL Server. One thing was also the running average. I only had some 50 different stocks and a history way back 3 years I think. Here’s a bit a code I used: <pre id="code"> select TradeDate, TradePrice, (select avg(TradePrice*1.0) from Stocks t2 where t2.TradeDate between dateadd(dd, -10, t1.TradeDate) and t1.TradeDate ) as moving_average from Stocks t1 </pre id="code"> It was not really overwhelming performance when wanting to compute the 38 day running average and the 250 days running average. Even with a covering index. I ended up adding an indexed computed column, like Twan mentioned. So, breaking the theoretical rules and keep redundant data. Another idea I never explored any further was to have another column with a UDF as DEFAULT. And in the UDF calculate the average at INSERT time. Just my $0.02 cents anyway. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a>

Frank, your solution is quadratic, number of rows accessed is N * (N+1)/2 where N is number of rows returned. Row by row solution is better since it has to access each row once. It means even cursor is better. Comlete discussion and faster data set solution is in the article. The main piece of code is: <pre id="code"> update t set @total = t.total = @total + t.amount from table t (index = <index that forces proper order&gt<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ /></pre id="code"> or: <pre id="code">update t set @total = t.total = @total + t.amount from (select top 100 percent a.id from table a order by <whatever order is needed&gt<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ /> as o inner loop join table t on o.id = t.id option (force order)</pre id="code">

Hey, I’ve never mentioned that I think my code is optimal. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] Probably a case for Analysis Services, but I’m not into that anyway. I finally ended up downloading the raw data into Excel and do the stuff there. A whole lot easier, especially when doing comparisons between several stocks, draw nicely coloured graphs for management and stuff like that. Running averages is just basic technical analysis. There are many very sophisticated methods which aren’t simply made for implementing into a database. – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a>

I agree, the client-side processing of such calculaton is the best solution for sure.

For the problems in hand where running average is not presentational issue, but further processing is going to be done based on historical data, I would add real columns with calculated aggregations. In that case it doesn’t matter how long it takes to calculate aggregations initially, later processing based on them will be fast and not resource intensive.

Here’s a possible alternative that I use for some similar processing involving transaction activity at about 15K businesses/day – instead of building the average/max/min/count/etc. completely from scratch each time, I have a separate summary table which has the previous day’s summary totals in it. I run a summary view on the newest data only, and then compare/add it to the previous summary values. This gives me a constant running summary which I can then use in other processes. FWIW, there are about 7 million records in the live data table; updating the summary table with data for 1-week, 1-month, 6-month, 1-year, and lifetime numbers takes about 30-40 seconds total. No cursors involved at all.

Perhaps just as valuable as knowing how to plow through 21k records which report the same temperature for six hours, might be to consider that as with Zip and other breakthrough technologies, we sometimes may wish to analyze our use of database space. Simply changing the database to only record and insert a new record every time the temperature ‘changes’ would effectively be tracking all of the same information just as completely as blindly throwing down a new record every second, minute or whatever. Full historical data could be reconstructed or otherwise extrapolated by just recording the changes in temperature when they occured. Just my 2 cents from the less of a trees more of a forest guy perspective.

Good point, Informative. In this case, such "change only" records should also record the datetime of the previous change, otherwise you’ll have to do some heavy processing to determine the number of intervals, which you need to establish the average – if I’m not mistaken.

That solution may introduce its own comlexity, depending on statistics required and exact table design. It may be feasable, but it is hard to judge without more details about the requirements and the solution. Anyway, the problem fits better analytical processing area, maybe it is not bad idea to start discussion there. I’m just reading Kimball’s "The datawarehousing toolkit" and one of his first recommendation is: Do not populate the fact table with zero measure values meaning "nothing happened".

quote:The problem is similar to the running totals problem I wrote article about. It should be published soon. I’ll post the link here when it happens.

http://www.sql-server-performance.com/mm_cursor_friendly_problem.asp

]]>