SQL Server Performance Forum – Threads Archive

Estimated cost heuristic

Hi guys, Do you use a rule of thumb for the estimated cost from the estimated execution path of a SQL query, when trying to review it ? I’m seeing numbers between 0.00328 to (absolute worst case) 445. My gut-feel is to use 1.0, and say that anything above that goes into the "definitely needs work" pile. Ideally, one would do all, but realistically, it is a DB with 700 + tables, >5000 SQL that have never been QA’d, against a 160 GB database. You input would be appreciated. CiaO

the plan cost in the execution plan is not a reliable indicator of actual cost,
use profiler to capture the query CPU cost.
the plan cost can be used to find parallel plans, for which one needs to make the decision on whether to leave it alone, or restrict the use of a parallel plan rather than go on query cost alone, you weight the query cost with the number of calls to that query, then focus your efforts on the query with the most impact,
also, if there are high number of calls to query with 0 cpu cost as reported in profiler, that needs to be looked separately, i would consider weighting the 0 cpu cost as 5-7, unless more info is available

quote:
rather than go on query cost alone, you weight the query cost with the number of calls to that query, then focus your efforts on the query with the most impact,
also, if there are high number of calls to query with 0 cpu cost as reported in profiler, that needs to be looked separately, i would consider weighting the 0 cpu cost as 5-7, unless more info is available

Ideally, I would definitely like the trace the production environment, then use duration and CPU cost to get my "hit list". but one of our "funs" is a very suspect HBA environment, where we get significant IO queue spikes. Given that, there is no way they will let me profile production ( they have REAL pains there). I am trying to get a dedicated test environment, but getting them to give me 300 GB on a server (backup and database), is an issue. I was starting to troll through the stuff in the meantime, and work-out my initial hit list. That was where I was trying to use the estimated execution path – I can generate those frlom production w/o impacting the production server. When I can, then I will try and go the route you outlined. CiaO

the io queue spike is possibly the checkpoint, typically ocurring every 2 min.
its going to happen anyways and is not impacted by profiler.
you can set the recovery interval to 5-10min to reduce the frequency of checkpoint see BOL for "recovery interval option"
"Keep recovery interval set at 0 (self-configuring) unless you notice that checkpoints are impairing performance because they are occurring too frequently. If this is the case, try increasing the value in small increments." profiler is a light load even on a production server that is heavily loaded.
just so long as you save the trace to a file
IMPORTANT: DO NOT save to a SQL Table on the production server. otherwise, you are wroking blind, and could be tuning many complex queries that have little overall impact on the production server

We’ve looked at the checkpointing for Queue lengths. If only it were that. We can cause, at will, on a stand-alone data server, with no other users, queue lengths that jump from 0 to 400+, and stay there. *sigh* We are trying to prove that it is San related. We have cache hit ratio on the Storage Array of between 1 and 10%, when they should be 80% + All of that is extraneous to this specific issue tho … hmmm … quite possibly I am trying to drag too much out of SQL Profiler. I usually take pretty much all events and data objects, so that I don’t miss anything. What I really should do, is take the time to learn more about profiler, and strip that selection down! I usually write to file, so that I can take it elsewhere. Is there a good article/link on proilfer usage for performance tracking and tuning hereabouts ? Then again, perhaps I should just try and find the time somewhere to BOL and hit the Manuals. Maybe I can get by with 3 hours sleep ! *grin*

Seehttp://www.sql-server-performance.com/sql_server_profiler_tips.asp HTH. Gaurav
Moderator
Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard

Thanx Gaurav. How embarrassed am I to find the Duration template !?! That should help.

you should also add the cpu and reads columns to the duration template
if disk queue length jumps to 400, what is avg sec/write, writes/sec, avg sec/read & read/sec
disk queue spikes by itself in not a big deal if it clears up quickly, a surge that does not clear up indicates inadequate disk throughput, commonly a result of:
1) using RAID5 instead of RAID1+0, or even worse RAID5 with advanced data guard (extra parity)
2) insufficient spindle count, common mistake is buying large capacity drives, instead of many smaller 18 or 36GB 15K drives
3) tempdb log if you have a query with plan of cost 445, it may involve hash joins, and other operations that spool to tempdb, in which case the tempdb should be on a separate drive as well as the main app log drive. Note: a 445 cost plan should generate large queue spikes, otherwise it wouldn’t cost that much
also, hyper-threading should be disabled for parallel plans

Joe, I’ll look at raise the IO queue issue as a seperate post. Unfortunately, they don’t clear up quickly – we can see at average over 100 for up to a minute, or longer. We don’t use local drives : we have SAN connected storage via Fibre Op cards. The theoretical numbers of spindles, bearing in mind that Queueing is basically what OS reports in terms of number of IO’s waiting to be processed, should be 2, since it hits a very large (8GB) cache. Due to the very low cache hit ratio on that, SAN is repeatedly having to run around, find data, load it, to resolve each IO request. Even the EMC guys are currently struggling to understand how we cna have such bad Cache hit ratio on the SAN. Back to this thread: *grin* The 445 query is : SELECT Distinct PlanNo AS KeyValue, PlanNo as SortValue
FROM PlanBasedTxnTrigger –34 mil rows
WHERE PBTxnTriggerStatusCode = 0 /* Pending */
AND PBTxnTriggerTargetDate <= (SELECT RunDate FROM Run where Runid = 1)[email protected]) –3192 rows
ORDER BY PlanNo This does a Index seek on Run, Bookmark lookup, "parallelism broadcast", Index scan on 34 Mil table (*frown*)(87% of total cost), sort of scan (12% of total cost), Then a nested loop/inner join to combine them, then a Stream aggreagate, Parallelism/reparation streams, Stream Aggregate, Parallelism/Gather streams. Re-wrote query as join, no difference. Need to improve the 34 Mil access – it is (87+12) 99% of the cost. But that is just that example. excuse me while I apply nose back to grindstone. CiaO

emc has a habit of recommending large caches on the disk controller, disks in RAID5 and a low spindle count, so they frequently run into the high persistent disk queue problem.
on a database server, the system memory is the data cache, hence large disk controller caches just gets in the way.
try setting the disk cache to write cache only, no read caching. on your query, assuminng PBTxnTriggerStatusCode is not a bit type,
PlanBasedTxnTrigger is indexed on:
PBTxnTriggerStatusCode, PlanNo, PBTxnTriggerTargetDate, this should get rid of the scan, and sort
also consider moving the subquery:
SELECT RunDate FROM Run where Runid = 1)[email protected]
out, as only the max RunDate is applicable?
ie, SELECT @MaxRunDate =MAX(RunDate) FROM …

Minor improvement after adding your index. Cost from 484 to 5.55 *grin* Worked like a charm.

]]>