SQL Server Performance Forum – Threads Archive

Write Disk Queue Length over 200

Hello, We have a strange problem and I can not figure out what is causing it. About every 20 min (sometimes 25 sometimes 15 min) we have a performance problem. The Write Disk Queue Length exceeds 200 at that time for about 30 sec. We can however not find any query that is writing more than normal during that time (checked with profiler). All other counters are fine, the Buffer Cache Hit Ration is over 99.6% The machine we are using is a Dell Poweredge 4 CPU; Raid 10 for the data and Raid 1 for the transaction log. What can cause this enormous queue length?

Do you have autosrink set to on? What are you database file settings?

autosrink is off. The database file settings are set to automatically grow file and the recovery model to bulk-logged. The database files have plenty of room (more than 10% unused)

Are you running transaction log backups? Do you have sepearte disks? If so, what disk is having the write queue lengths when this happens (transaction log file disk, data file disk, backup disk, etc)? It could be your transaction log file growing during this time. What kind of a disk system is this? Do you have any jobs that run every 30 minutes? It sounds like you already checked this. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] MeanOldDBA [email protected] When life gives you a lemon, fire the DBA.

We run a transaction log backup every 12 hours and a database backup every 24 hours. The transaction log is on a seperate disksystem (RAID1) and the transaction log has at least 50% free space. The transaction log is not growing during this period. The queue length of this disk is also high during this period (about 12) but not nearly as high as the queuel length of the data drives.
We indeed do not have a job that runs every 30 min. The time period between the high queue length is also not exactly every 20 min. Sometimes it is 17 min and then it is 23 min. the period it is this high is always about 30 sec.

You don’t have VirusScan software on this server do you? If you do have it, have you verified that the SQL Server Program Files, .bak, .mdf, and .ldf files are excluded? MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

No there is no virusscan software installed on the machine

Well, then the only thing I can recommend is to run a complete trace during this time. Write the trace to a text file so you don’t overburden the server. Capture an hour of everything. We have solved some pretty "impossible" situations doing this in the past. It produces an hour or so of slowness, but you usually get the visibility you need to identify the problem. Make sure you track reads/writes/cpu/recompiles/deadlocks/stored procs/rpc. If you have inline SQL, it gets a lot more difficult, because you have to trace statement level. MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

We did this for reads, writes, duration and cpu but not yet for recompiles and deadlocks. However the queries in this period were very normal and don’t have a performance problem. Write queries were hardly found in the log. I will however do an extra check on deadlocks but I can hardly imagine this is the problem while it occurs so reguraly

What’s your recovery time set to on the SQL Server? I’m just wondering, with the timeframes you’re giving me, if it could be the recovery interval. You might want to set it to 5 minutes and see if that changes this behaviour. I have seen this cause this type of issue before, but it’s usually on disks without a very good backplane or cache system. Setting it to 5 minutes will override the 0 setting of automatic configuration. MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

I’m not familiar with this. We did not change any recovery time so it will probably be automatic configuration. Can you explain how to set the recovery time and how I can see what the current configuration is?

EXEC sp_configure ‘show advanced options’,’1′ –You need to run this so the options are visible.
GO
RECONFIGURE
GO
EXEC sp_configure ‘recovery interval (min)’ –Shows the current setting.
GO
EXEC sp_configure ‘recovery interval (min)’,’5′ –Sets the value to 5 minutes.
GO
RECONFIGURE
GO
EXEC sp_configure ‘recovery interval (min)’ –Shows the current setting.
GO
MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

Thanks. Does this mean you think the database checkpoints every +/-20 min? Is this something we can monitor for instance in perfMon?

The database recovery as you indicated is in bulk Logged mode. Are you doing any bulk logged operations during the time disk queue lenght shoots up. Also the disk queue length is it because of SQL Server? Check for following in perfmon: Under Performance Object:
Physical Disk:
Average Disk Read Queue length
Average Disk Write Queue Length
Current Disk Queue Length
Average Disk Queue Length
Average Disk sec/Read
Average Disk Sec/ Write
Average Disk Sec / Transfer Select Process
Counters from list select
Page Faults / Sec for Total and SQLServr
If maximum Page Faults are because of SQL Server then may it is related to SQL Server.

You cannot track it in perfmon or profiler. I don’t know if your database is checkpointing ever ~20 minutes. I have seen it cause this issue before though on a slow disk subsystem, so I would check it out as a possibility. MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

Gurucb, When the queue length is this high, the process who is causing it is sqlsrv. The Disk writes/sec are at that moment much higher than normal (around 500 where normal is around 25) The Disk sec/Transfer is also much higher (0.23 instead of 0.005 at normal operation) The page Faults are not different at this moment from the normal operation. The total page faults are alway much higher than the sql server page faults. Around 3x as high.

everything described so far indicates this is a checkpoint,
if it takes 30 sec at 500 writes per sec (30×500=15,000),
and you want the checkpoint to be over in 5 sec,
then you need a disk system that can handle 3000 writes/sec.
the alternative is to determine if the number of dirty pages can be reduced,
a better cluster key and index strategy can do this.
i am somewaht puzzled that the checkpoint occur at 20min intervals instead of the more common 2-3 min

I have monitored the counter SQLServer:Buffer ManagerCheckpoint pages/sec. This value is continously 0 till the problem arises. Then the value is 626. Does this mean the database only checkpoints every +/- 20 min? If so changing the recovery interval will solve our problem?
The indexes are set to fullfill the query needs. The database has not so much updates or insert compared to selects. For this purposes the indexes are performing fine however I will check if dirty pages can be reduced.

changing the recovery interval does not solve any problems, it only changes the frequency of occurences of the checkpoints, so would you like a checkpoint to occur every 20min lasting for 30sec or every 10min lasting for 15sec. the index efficiency strategy is not an easy thing to do, it is something i have for a number of people.
the easier solution is to get more disk drives, 500 writes/sec is kind of low for a database server,
if you get 1 rack of 14 drives (15K) in RAID 10 (2 SCSI channels), that should be 1000-2000 writes/sec

Agree with Joe on the drives. I will disagree (a very, very little) with him on the recovery interval though. The amount you would be checkpointing with smaller intervals would be less. If you can keep from exceeding the maximum IO capacity of the disks during thie process, you should see an overall improvement in performance. It’s basically a band-aid until you can either seriously tune the system or add more disks (the cheaper solution in this case usually). Either way, you have nothing to lose at this point by trying it. It sounds like you’ve exceeded the capacity of your disks. If the disks are on a batter backup, have you checked to see if the write cache is enabled? MeanOldDBA
derricklegg[email protected] When life gives you a lemon, fire the DBA.

what i don’t understand is why the checkpoints occur at 20 min intervals,
it should be 2-3min.
because of the way SQL 7 & 2000 does checkpoints, dumping all dirty pages into the queue,
i think you will notice the checkpoint performance impact at any interval unless you have a solid disk system.
i have also heard of people extending the checkpoint out, forcing SQL to evict dirty pages without a checkpoint,
which does avod this problem, but i think this is a dangerous unless one is absolutely confident of the reliability of their disk system.
SQL 2005 does fix this problem

Today I had the change to double check the configuration. It appeared that the recovery interval was set to 5 without my knowledge. This explains the 20 min interval I think. I have set it back to 0 and now it occurs much reguraly. The period is also shorter as expected. However the queue length is still very high at this moment.
The disks we use has an External (I/O) Transfer Rate (max) of 320 MBytes/sec . We use 4 disks (RAID 10) which should be able to handle 4×320 MBytes/sec = 1280 Mbytes/sec according the specs. How many writes/sec should this configuration be able to handle?

forget the MB/sec,
each disk can do 200-300 random IOs at high queue depth, but divide by 2 for RAID 10, so 500 writes/sec is about for 4 disks, refer to my previous suggestion on going to 14 disks for data

At the moment the number of writes/sec is mostly between 200 and 300 when the queue length goes to 200.
If I understand you correct and 4 disks should be able to handle about 500 writes/sec we should not have a problem and the queue length should not rise this much am I right? Would this indicate there can be a hardware problem with e.g. the backplane?

You must be careful when adjusting the checkpoint interval. The checkpoint process is optimized to flush large amount of pages to disk. If the recovery interval is increased to much and dirty pages need to be freed the lazy writer will perform the flush of the page. The lazy writer is not optimized like the checkpoint process so if you adjust it make sure you watch lazy writer pages/sec counter. As far as 300 I/Os and should your disk handle itâ€¦well it depends. What is the RATIO of reads to writes. For RAID10 you can use the following formula Reads + (2 x Writes) = PhysicalIO Using your max of 300 IO/s lets say 30% are writes 210 + ( 2 x 90 ) = 390 During checkpoint the Ratio of writes I am sure is much higherâ€¦â€¦ Basically you need some more disksâ€¦. Bert

whether you have a problem or not depends on whether you think the checkpoint cause meaningful performance degradation.
if it does, then you should get more disks. let me see if i understand this correctly:
you have 4 SCSI disks for data
each is on a separate SCSI channel?

The checkpoint does cause meaningful performance degradation. The database is used for an active website. During the checkpoint the site is hardly available and user actions on the site has real bad performance at that time.
We have 1 RAID 10 configuration with 4 SCSI Disks + 1 hot spare on one SCSI channel. The transaction log is on a RAID 1 configuration with 2 SCSI drives on a seperate SCSI channel.

Do you have long running transactions? Recovery model is set to bulk logged, do you have regullar bulk inserts?

No we don’t have bulk inserts regulary and also no long running transactions. Al inserts and updates affects basically one row per statement/transaction (stored procedure).

ok, it looks like suggested hardware improvement is the only simple way to solve the problem.

Is there a possibility to reduce the dirty pages? I found one update statement that executes reguraly (more than 100 times in 5 min.) that results in a Clustered Index Update. I can not figure out why this happens but does this create many dirty pages?

scsi is a shread bus, if you have 4 drives on one bus, the bandwidth of that bus remains 320 not 4×320, but this is irrelevent anyways because the io issue here is random io capability, not sequential
the index strategy is a complicated matter that is not really documented anywhere

Ok, Thanks all for your input. I now understand the process what is causing our I/O problem and will go out for some more disks.

Marnix, there are still a couple bases we haven’t covered: 1. Which disk specifically is the queue length occurring on?
2. Is the write cache enabled on your controllers?
3. What type of controller and server are we talking about? Just out of curiousity, I don’t suppose these are old Compaq disk arrays are they? MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.

Derric, sorry for the late reply I forgot checking this forum topic. The queue length is occuring on the RAID 10 drive where the data resides (not the logs). Write cache is not enabled and is set on direct I/O. The machine is a complete new Dell PowerEdge 6600 with a PERC 4/DC dual channel RAID controller with 128 Mb cache.

You definately want to make sure the firmware/driver for the PERC is at the very newest; I was working on a performance issue the other day, with a PERC4 and a Powervault 220, and saw a drastic improvement in io performance after upgrading to the latest firmware and driver. I have also heard people suggest using the LSI firmware for a performance boost, but I definately would’nt do that on a production box. I believe 516a is the newest Dell firmware. Give that a shot before you buy more disks.

]]>