SQL Server Performance

Execution Plan Behaving Strangely

Discussion in 'SQL Server 2005 Performance Tuning for DBAs' started by Zippy, Aug 31, 2007.

  1. Zippy New Member

    Been working on a problem for the last 2 weeks where an about twice a day one page in our application which usually takes about half a second can take up to 6 seconds. It's a .Net web app running connection pooling with SQL server authenication.
    Been watching syscacheobjects and the query has a single execution plan with the usecount increasing with each call with about 900 calls per day. The usecount does not reset when I execute the stored procedure using with recompile. Not sure thats right but that may be a clue.
    By using SET STATISTICS IO On, I see something strange. When running normally the results look like this:
    Table 'SystemDefault'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workflow'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Person'. Scan count 1, logical reads 3, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Message'. Scan count 5, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'WorkflowStep'. Scan count 0, logical reads 124, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'WorkflowStepRole'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Order'. Scan count 2, logical reads 2058, physical reads 0, read-ahead reads 57, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Request'. Scan count 2, logical reads 1663, physical reads 0, read-ahead reads 184, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Organization'. Scan count 2, logical reads 47, physical reads 0, read-ahead reads 44, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Attachment'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 485, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Service'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    When the query is misbehaving the results look like this:
    Table 'SystemDefault'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workflow'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Person'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Message'. Scan count 5, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Service'. Scan count 10, logical reads 20, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Attachment'. Scan count 5, logical reads 25, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Organization'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'WorkflowStep'. Scan count 0, logical reads 188220, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'WorkflowStepRole'. Scan count 295732, logical reads 1182928, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Order'. Scan count 73332, logical reads 1034504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Request'. Scan count 2, logical reads 110370, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    The scan count and logical reads go crazy on the last 4 tables. Also seen it go crazy on just the workflowstep table and normal on Order and Request.
    Workflowstep is a table with 31 row and average size of 234 bytes. Can you have scan count of 0 and logical reads of 188220 ?
    WorkflowStepRole is a table with 279 rows, To scan that table end to end it's 4 to 5 reads which is consistent with the results when the query is behaving and the size of the table/clustered index reported by sql. The numbers above appear to suggest it scan the 4 pages 295732 times.
    The order table has 149100 rows, average row size of 604 and uses 11784 pages. Request table has 36669 rows, average row size of 308 and uses 1528 pages.
    When things are going well profiler shows about 600cpu, 4000 reads and 700ms duration. When things are going bad, profiler shows 8300 cpu, 2,510,000 reads and 8000ms duration.
    The execution plan is not changing as the single row in syscacheobjects and syscacheobjects.usecount is showing the same execution plan being used. The issue appears to be in the execution plan.
    Once it starts it does not clear itself until I do something. The problem can be cleared by using DBCC FREEPROCCACHE or by updating statistics on one or more tables. DBCC FREEPROCCACHE Will obviously force the recompile but updating statistics does not seem to alter the syscacheobjects.usecount. Executing the stored procedure WITH RECOMPILE does not seem to alter the syscacheobjects.usecount either. I would expect to reset the usecount. Perhaps it works out the plan and decides its the same as the one already there so it does not update it.
    issue does not appear at any specific time of day so I don't think it relates to busy server although I usually find it first thing in the morning where the server has been idle for some time. The statistics are updated every night at 1am and can come on the problem an hour later. Auto Create and Auto Update Statistics are turned on.
    I'm not able to recreate the problem but I've had profiler running and seen where at 2:03:03 the query ran perfectly and at 2:03:17 the query went south. I've reviewed every event log and sql log I could find and there is nothing crazy going on. The server is often practically idle when it happens although it will happen during busy times also.
    SQL 2005 64 bit running on Windows 2003, 4 processors, 12 GB RAM. DB was originally a SQL2000 db and moved to SQL2005, running happy for several months with this issue arrising only recently. Seens to be only this one stored procedure although it could be others. This is one of the most frequently used queries and also the one that has to work the hardest.
    Anyone seen this type of activity before? Any suggestions on where to look next?
  2. satya Moderator

  3. ndinakar Member

    Do you have any dynamic sql in the proc? Or do you see any pattern if the proc goes south for a particular set of parameters and behaves well otherwise?
    Sometimes it can be parameter sniffing. Do you have lot of IF's in the proc?
  4. Zippy New Member

    I did find something today that maybe the problem. The query uses WITH RESULTSET to support front end pagination and usually returns up to 25 rows. A couple of paramaters @StartRecord and @Pagesize specify which page of data to return.
    exec pOrderWorklist_managementcompany @SortBy=NULL, @Vendor=NULL, @ProductType=NULL, @Status='Created', @DateFrom=NULL, @DateTo=NULL, @State=NULL, @Zip='%', @County=NULL, @Region=NULL, @OrderIDs=NULL, @StartRecord='0', @PageSize='25'
    Another part of the application calls the stored procedure without those parameters but with a highly selectively filter @RequestId which will normally give between 0 and 15 rows.exec
    pOrderWorklist_managementcompany @ReqeustID=1234, @SortBy=NULL
    The stored procedure goes something like this...
    WITH RESULTSET AS (SELECT.... , ROW_NUMBER() OVER(ORDER BY ...) AS RowNumber FROM ..... INNERS/OUTER JOINS... WHERE...)SELECT *, (SELECT COUNT(*) FROM ResultSet) AS RecordCount FROM ResultSet
    WHERE (ResultSet.RowNumber >= @FirstItem AND ResultSet.RowNumber <= @LastItem)
    OR (@StartRecord is NULL AND @PageSize is NULL )
    ORDER BY RowNumber
    That bit of code which deals with @StartRecord and @Pagesize being NULL is only ever used when the second type of call is made with @RequestID and not much else.
    I noticed on a couple of the Profiler traces that the @RequestID call was used shortly before the other query went south. I'm wondering if the optimizer assumed that all queries would probably need all the rows and altered or it all queries were likely to pass @RequestID. This I think is what you call parameter sniffing.
    Both part of the application use pagination so this section of code is not needed. I think this bit of code was used in the past for an export to excel functionality where all the rows are required but that functionality used it's own stored procedure now.
    I've commented it out and lets see what happens. The query has got a little quicker as I suspect it was trying to predict whether the user wanted all or just a subset of the results. Now that I've told it that it will always be a subset I'm hoping it will behave and not try to second guess me. I'll let you know how it goes.
    Thanks for your contributions.
  5. ndinakar Member

    [quote user="Zippy"]I noticed on a couple of the Profiler traces that the @RequestID call was used shortly before the other query went south. I'm wondering if the optimizer assumed that all queries would probably need all the rows and altered or it all queries were likely to pass @RequestID. This I think is what you call parameter sniffing. [/quote]
    Yes SQL Server tries to guess the parameters in advance and in turn ends up screwing the query plan. This happens mostly when you have a lot of IF loops. SQL Server creates multiple query plans for each condition. And it tries to "guess" your parameters even before it reads them and chooses a query plan that "it thinks" is best. Changing the code - either by reducing the IFs or using query hints or changing the way tables are joined etc, you can avoid this parameter sniffing.
    And what you did was a good catch too. That's exactly how you need to debug an issue. good luck.
  6. Zippy New Member

    I thought I had it sorted by altering the sql but this morning when I checked profiler I see huge reads again on the same 3 tables and repeating the query with SET STATISTICS IO shows the same kind of result as before. One problem with this query is it's everything to everyone. There's 9 parts to the WHERE clause and they are not all exclusive. Although the query will usually only have to work with one if users can occasionally select filters which affect more than one of the WHERE expressions. The SORTBY has 36 bits to it (18 columns ASC and DESC). I've already split this query up based on the user group and simplified it as much as I can to reduce confusion for the optimizer. Looks like I'll have to do a little more.
    What's weird about this is that it's not on recompile that it goes south. It works fine 99% of the time. When it goes souht and I UPDATE STATISTICS on any of the tables that show the high logical reads the query usually sorts itself. SYSCACHEOBJECTS.USECOUNT is not reset suggesting that the execution plan is not rebult. Somehow within the compiled plan it decides to do something differently. I though the whole idea of a compiled plan is to work out once what the best joins are and then reuse them when that plan is reused. If statistics get out of date the optimizer may recompile a new execution plan but thats not happening in my case. The execution plan does not get updated.
  7. josj New Member

    just for fun..if possible, have you tried to replace the cte with an old fashion temp table? I´m not sure on how sql server maintains stastistics for cte´s. Maybe someone else know that?
  8. satya Moderator

    Have you cleared the cache before executing this process?
  9. Zippy New Member

    If I clear the cache or update statistics the query runs fine. It's usually after the plan has been created and has been used for several hours that it then decides to slow down.
    Tried the temp table but the initial result set can be 1000's of rows and writing that to a temp table and then select 25 from there uses a lot of resources. When this query runs it runs very well. Less than a 500ms and about 4000 reads. When it decides to go south its up to about 8 seconds and about 1600000 reads showing up in profiler.
  10. satya Moderator

    Then you should update the plan intermittently using UPDATE STATISTICS (say once in 2 days or depending upon your data load).
    In this case if the TEMPDB is optimized enough to tackle then no issues in using the TEMP table concept.
  11. Zippy New Member

    Statistics are updated every night at 1pm. I've seen this go bad just one hour after that update. Current plan of attack a combination of splitting the query out as much as possible into separate queries and adding hints to use specific indexes.
    Whats interesting/confusing is why the exact same execution plan can run very well or very poorly. It does not recompile a new plan and go bad. It used the same execution plan and something within that plan makes it take a slight different route.
  12. satya Moderator

  13. Adriaan New Member

    Did someone already mention parameter sniffing? Try copying the parameter values into local variables, and use the local variables throughout the code.

Share This Page