Major performance issues on the SAN | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Major performance issues on the SAN

SQL Server 2000 in a clustered environment with 4 physical (8 virtual) processors We’re having major performance issues on our system and I believe that what we’re seeing is an I/O problem because the average disk queue length is 112. Our SAN administration team says that they don’t see a problem can anyone validate our hypothesis? We are, of course attempting to optimize all of our operations, however, operations that are optimized in the development environment on a dual processor machine seem to go backwards in the production environment. Any assistance would be greatly appreciated Thanks,
Meg for reference and notes. HTH Satya SKJ
This posting is provided “AS IS” with no rights for the sake of knowledge sharing.
Thanks for your response and link. Currently the SAN is 500 GB divided into 3 logical drives that reside on the same, single physical drive. Although our data files and log files are on separate logical drives, because of the configuration of the SAN, they are on the same physical drive. I see this as a problem and I’m wondering if anyone agrees. Also, do you see the disk queue length statistic as a problem?
The development environment, is not a good parameter. This problem is constant or a "come and goes" ? Is the data volume in development and production, the same. Remember that there are algorithms, that are exponential in its behavior. You must check all of this. Another question , is the activity read or write ?, you could get this
from the performance monitor.
the disk queue depth by itself is not sufficient. 112 would not be bad if you had 100 physical disk drives, but would be bad if you only had 20. what are your disk latencies, avg disk sec/read & write.
if queue depth is less than 1 per physical disk, this should be less than 10ms for 15K drives and <15ms for 10K drives
also, high queue depth & latency during checkpoints is not a big deal if the checkpoint completes quickly, 10-15 sec for example. it is absolutely imperative for disk intensive DB apps to get separate physical disks for data and logs. it seems most SAN vendors believe their system is sufficiently powerful that they can disregard this, but every time, the DB ends up with absolutely horrible performance. do not partition physical disks into tiny 13GB LUNs. disk latencies need to be low (<15ms for random, <1ms for sequential). Log write latencies absolutely must be <1ms.
checkpoint need to complete quickly, otherwise the DB performance will be poor
Also, you need to get a workbook from your SAN team. What you are saying doesn’t really make sense. What kind of SAN is this? 1. What are the physical drives in the SAN (size, speed, fiber?)
2. What disks compose what RAID groups on the SAN (RAID type and size/disk type and size)?
3. What LUNs are composed of what RAID Groups? What RAID type is used to build LUNs? How are the LUNs divided between RAID groups if this was done? Going along with what joe was saying, what are your individual disk queue length counters on the server. Don’t look at the total. Get it down to the exact drive having long lengths. Is it log, tempdb, data, OS? When combined with the information from above does it constitute a bad queue length? Also, do you have any other issues on the server? What are you buffer cache hit ratio, processor utilization, etc counters looking like? Have you ran Profiler to look for long running queries? Have you looked at locking/blocking to see if you are "perceiving" slow time when actually it’s just a design issue? MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.
Thanks very much for the infromational posts on my issue. And let me follow up on some of the questions posed in your posts: JoeChang: It is my contention that our SAN is not configured properly. Although we have our data and logs on separate logical drives, it turns out that at the physical level our 500 GB of storage is spread over ONE PHYSICAL DRIVE. The disk queue counters I mentioned are for the single drive with our data files on it. derricklegget: We’ve looked at locking and blocking but there’s nothing out of the ordinary in those statistics.
Hi MegSebastian, as far as I know there is no such thing as a single 500GB hard disk drive, so there must be multiple drives in the single physical RAID set? Cheers
the strongest argument you can make will be based on the disk latencies, as read from the perfmon -> physical disk -> Avg. Disk Read/Write/sec.
if these numbers are not 0.015 (15ms) or less, your disk system will slow down the rest of the system
the san vendor may argue the OS disk latencies don’t reflect the true disk, it does not matter, thats what your application sees.
Agree with Joe and Twan on this though. Look at your counters. If they are good, you have no case. It’s impossible to have a SAN running with a partition running ONE physical drive. They probably mean one physical RAID drive. This is SAN terminology for what some people call a RAID array. An Array in SAN terminology is the entire system basically. Doesn’t that just make your head spin. <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /><br /><br />MeanOldDBA<br />[email protected]<br /><br />When life gives you a lemon, fire the DBA.
Ok guys–here’s the deal. Our disk latency counters were about 50 times the amount specified combined with Disk Queue lengths in the 200’s. Spoke with the SAN team and found out we’re on EMC Symmetrix with our single raid drive spread over 12 physical drives. It turns out that one of the HBA’s was malfunctioning resulting in it having a queue length in the high 200’s (a.k.a. the rat in the drain pipe). They’ve fixed this problem and now the Disk Queue lengths are lower and the latency better but still not in line completely (Disk Queue Length 50) (Disk sec/read .286) I’m having them make sure that the logs and data files are PHYSICALLY located on different disks as you suggested. Thanks for all the help!!
if you have 12 physical drives for data, with logs somewhere else, then avg disk queue (ignoring checkpoints) can be as much as 24,
checkpoints should complete as quickly as possible (15sec?)
but the avg disk sec/read needs to be in the 0.15 range (except during checkpoints)
the log disk latency should be <= 1ms
If they have a single RAID drive on 12 drives, please tell me it’s RAID 10. ????? If it’s RAID 5, they need to all be shot. (Is that even possible?) <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /> See if you can isolate the RAID arrays to their own SP on the SAN and see if that solves your issues. I really think putting the log files on their own RAID 10 array should solve a lot of your issues.<br /><br />MeanOldDBA<br />[email protected]<br /><br />When life gives you a lemon, fire the DBA.
Now for the latest on the SAN front. We are working with our SAN vendor to redo our implementation–especially now that we have learned our data will double in the current year and our user community grow by 3 times. Our database has several fairly large tables in it (over 10 million rows). Currently they are packed into a single data file and my plan is to spread this out over several files using more logical devices and of course as many underlying physical devices as required. We have the option of switching from a Symmetrix to a Clarion SAN and I’m wondering if anyone here has an opinion about that.
uhhh, the Clariion is a downgrade. I would definitely have an opinion about that. If you are going to a much newer model, the Clariion will outperform the Symmetrix. If it’s the same "generation era" though, you’re taking a step backwards. Of course, if the current Symmetrix is overloaded, and they’re going to let you have a dedicated Clariion just for your purposes, you would be looking at an upgrade.<br /><br />I would suggest going to and looking at the product information. They have a lot of good information, as they want to sell a lot of product. [<img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ />]<br /><br />MeanOldDBA<br />[email protected]<br /><br />When life gives you a lemon, fire the DBA.
the Symmetrix is old, so if you are buying a new Clarion, that should be better than the old Clarion,
i would recommend not letting the factory rep blindly configuration the standard one size fits all configuration, that typically has the worse possible performance characteristics for a DB server.
set random and sequential IO requirements for both read & writes, then insist that both the entire system and the actual configuration will meet the requirements