SQL Server Performance Forum – Threads Archive

Serious Performance Issues with Compaq SAN and SQL

I work for a large company that has an OLTP SQL Server Cluster that I am solely responsible for. This environment has about 150 Databases serving different divisions within the organization. About 9 months ago, this environment was brought online using a Compaq Storageworks Enterprise Virtual Array. It is made up of 72 Disks in a RAID-5 Configuration for Random Read/Write Data and 18 Disks in a RAID-10 Configuration for Sequential activity only (i.e. – Logs). I have been experiencing a SQL Server/Disk IO problem for 10 weeks now and still there is no resolution from HP/Compaq on the issue. Here is a description of the problem: Those familiar with Windows 2000 Internals know that when Disk IO’s occur, they are handed off to the Disk Controller(s) to be serviced. What is happening is that whenever Writes occur to the Data Segment, and are being handed off to the HBA’s(Host Bus Adapters), the outstanding write queues go through the roof! I have experienced Oustanding Write Queue lengths of more than 170/sec sustained at times. This is also represented by the disks being utilized 100% according to Performance Monitor, but not via the SAN Monitoring tools. The facts we know about our issue are: 1. The Fabric Switch Ports are not over-utilized 2. Server-side resources such as memory, cpu, network, etc. are not breaking a sweat. 3. The Disks on the SAN are serving other customers without any issues in regards to speed. 4. Fibre has been replaced for both HBA’s in the server(s). 5. The latest Firmware is in affect on the Compaq EVA 6. Any locally attached disks on the same SQL Server(s) do not reproduce the problem only SAN Disks 7. This problem is reproducable on any SQL Server that we present SAN attached storage to. We have gone throught just about every configuration change from the HBA to the SAN Switches themselves to try and correct the issue, but none have proven to work. I have spun my wheels trying to convince HP/Compaq that the problem is real and that users are being impacted by it as perceived "Slow Response Times" on their end. Any insight from experienced SQL Server DBA’s/System Engineers as to what could be causing the bottlenecks on the SAN would be greatly appreciated.[B)]

The issue to me is the RAID-5 choice. Especially if you’re experiencing poor write performance. RAID-5 is the worst performing I/O solution for writing. An ideal solution to your problem would be to switch to RAID-10, but I doubt that could happen. Can you add more write cache to the subsystem? Perhaps that could help cushion some of the writes? Or could you dedicate some read cache to write caching instead?
The only other thing I could suggest is to make sure you have your data files spread over as many drives as possible, the indexes are "refreshed" by reindexing and the data files are not auto growing during execution.
There is a DBCC SQLPerf(waitstats) function that will tell you if you’re having I/O problems in SQL. The values are cumulative so run DBCC SQLPerf(waitstats, clear) first then let your system run for a few minutes before running it again. If there is a lot of pagelatching or latching in general, you should try to make sure the files that are getting latched on are spread over all the drives…

I have just posted something related to write perforance and our IBM fast array. Aparently there is a bug introduced in sp3 for windows that causes it to issue write commands with the ForceUnitAccess bit set on every write. This bypasses your controllers cache and forces it to syncronize everything in cache. The article and hotfix can be read at: http://support.microsoft.com/default.aspx?scid=kb;en-us;332023 I agree RAID 5 is slowest on write activity, but if you have just upgraded to windows SP3, and the problem has just came up, this could be affecting you. Chad

On some occasions, I have seen SAN devices produce odd counter responses, like you describe. In the cases I have seen, it appears that the way the counters work are not compatible with the SAN. I don’t know if this is your case or not. Other than using counters, have you been able to perform any performance testing to establish that SQL Server is really experiencing a performance problem? If so, what did you do and what were the results?
—————————–
Brad M. McGehee, MVP
Webmaster
SQL-Server-Performance.Com

Hi, was this problem resolved and if so, HOW? I am running a HP EVA (Version 2, think yours is a Version 1) and currently its ok. But one of the things I would really like to understand is how to monitor the EVA for disk contention or likely performance issues as the databases (Unix and Win2k SQl2k, on separate zones) are thrashed/loaded. Any help or guidance, appreciated. Regards, Brian
+ 61 400 870 295 Brian Sutherland

Marbos, Has there been ne changes after applying few thot’s of Chadi & Brad ? Kindly let us know.
RoyalSher.
*********
The world is the great gymnasium where we come to make ourselves strong.

I’m agree with Sqljunkie. In my experience you will have more performance with data on RAID 10 and log on RAID 5. But, can the RAID 10 hold all data? Luis Martin

It is recommended that log devices be on RAID10. For data and indexes RAID10 is also preferred, especially if the workload is write-intensive. The penalty for RAID5 is less for read-heavy workloads. However, if you do not have the recommended hardware, you can use alternative configurations _________
Satya SKJ
Moderator
SQL-Server-Performance.Com

The problem isn#%92t confined to the HP EVA solution. It#%92s fairly common in SAN implementations. To better understand the problem, we first need to know a little something about the EVA architecture. In the EVA we assign physical drives to disk groups. All of the drives in a group are striped RAID 0, and this become the storage pool from which we allocate. If all of the physicals are the same size, IO is distributed evenly across them. If you have physicals of different sizes in the same storage group, the larger drives will have more chunks, and therefore a proportionally higher percentage of the IO. An HP best practice is that all drives in a group will be the same size. This ensures an even distribution of the load. Next, we need to figure out what kind of load our configuration can support. For this discussion, let#%92s assume that the entire allocation pool of a storage group is presented as a single LUN. RAID 1 is fairly simple. The first chunk is a primary, and the second is a mirrors. This pattern continues across the physicals until we come to the end of the LUN. A 10K drive can support ~ 100 IO/sec. For an eighteen physical set, we would characterize the read performance as 18*100 = 1800 IO/sec. Each write will result in two writes on the backend; one to the primary and one to the mirror. We can characterize write performance for an 18 physical RAID 1 as (18/2)*100 = 900 IO/sec. Assuming a 1:1 read/write ratio, our overall performance would be (1800+900)/2 = 1350 IO/sec. RAID 5 is a bit more complicated. If you have more than eleven physicals, they are broken down into six member RAID 5 sets. With seventy-two physicals, that makes twelve sets. For each set, we have five data chunks and one parity chunk. Parity chunks are dispersed across the set to avoid hot spots. Read performance can be characterized as (6-1)*100 = 500 IO/sec for each set. We multiply that by the number of sets, giving us 500*8 = 4000 IO/sec. Each write results in two reads and two write on the back end. We write the data, read the data, read the parity, calculate the parity, then we write the parity. We can characterize the write performance as (6*100)/4 = 150 IO/sec for each set. We sum the sets to calculate the write performance for the group; 8*150 = 1200 IO/sec. Given our 1:1 ratio, the average would be 2600 IO/sec. Cache can provide some benefit. Cache can give us some benefit, which varies depending on the load profile. For sequential IO, caching is very effective. We can achieve upwards of 4000 IO/sec in our 18 physical RAID 1. With RAID 5, we get the added benefit of eliminating one of the backend reads. When the data is written, it is read back from cache when calculating parity. For this reason, you should always have some amount of cache on a RAID 5 set. This increases our write performance to 1600 IO/sec, and our over all base RAID 5 performance to 2800 IO/sec. For a sequential load profile, we can see upwards of 7500 IO/sec. Random IO for large data sets does not perform anywhere near as well. When the size of the data set vastly exceeds the size of cache, we only get about a 5% improvement from base. We still get the write improvement for RAID 5 however, so it is worth doing. For purely random load profiles we would expect about 1425 IO/sec on our RAID 1 set, and 2950 IO/sec on our RAID 5. With some architectures it#%92s possible to allocate cache to individual LUNs and set the percentage of cache for reads and writes. This type of cache tuning has its merits, but is moot in our case. In the EVA, the only real option we have is to add more cache. If you#%92re really interested, just let me know and I#%92ll explain how to set cache ratios for RAID 1 and RAID 5 on those systems that allow it. So far, we#%92ve assumed that all the space in a group is occupied by a single LUN. This is rarely the case. Typically we carve out LUNs of varying sizes and present them to multiple servers. An important thing to consider is that the sum of the IO for all the LUNs carved from a group will hit the same physicals. This is the equivalent of placing multiple logical drives on the same physical drive. The problem with this is that high utilization against a single LUN can adversely affect every other LUN in the group. In shared storage situations, this can be very difficult to troubleshoot and isolate. There are some things you can do to minimize the effects by manipulating driver settings, but the only way to achieve real isolation is to put the LUNS in separate groups. In the scenario described here, as long as we#%92re not overloading the group, it doesn#%92t make much difference for the databases. The load pattern is still going to be purely sequential. For the logs, when we put more than one set of logs in a group or load pattern changes from sequential to semi-random. We lose much of the benefit of caching. We have yet to discuss how the IO gets moved around. The drive chassis on the EVA has dual 2G fibre loops. This shoud be adequate for the number of drives you have. How you attach is equally important. Fibre channel is a, well, channel based medium. Unlike packet switched networks, we use the concepts of exchanges and sequences. Frames are the construct that moves the data. Sequences consist of sets of frames. Exchanges consist of sets of Sequences. A frame has a payload of 2118 bytes. After some setup at the start of the conversation, all the frames are sent in order with basically just an RDY sent between each frame. At the end of the conversation, it was either successful, or the whole thing is resent. There is no out of order reassembly. Without getting into the nitty gritty of BB credits and so forth, it#%92s simply important to know that there is some setup and then all the data is sent in order. This setup overhead is constant regardless of the IO size. One consequence of this is that smaller IO block sizes result in lower MB/sec rates. Even though a 1G fibre channel is capable of 100MB/sec, for a 4K IO size that actual rate is around 25 MB/sec and for an 8K IO it is about 35 MB/sec. When you see those astronomically high IO/sec rates advertised by some vendors, realize that they were achieved using a 512 byte IO size. Knowing your IO request size and rightsizing your expectations is key. So far, I#%92ve only talked about the backend. We need to understand the impact of the file system on IO. It all starts with the application. SQL has an 8K request size for database pages and logs. For Exchange and Notes, it#%92s 4K. For Oracle it can vary by table anywhere from 2K to 16K. It varies for file server, but 64K is a good assumption. The OS maps the request to allocation units on the disk. By default, NTFS has a 4K allocation unit size. This is set when the disk is formatted, and can only be changed by reformatting the disk with a different allocation unit size. If the IO request from the application is larger than the allocation unit size, the OS splits the request. If you are running SQL on an NTFS volume that was formatted with the default allocation unit size, each 8K request results in two 4K requests going over fibre channel to the disk. For locally attached SCSI, the penalty is small. For fibre channel, the penalty is a whopping 40%. I have recently verified this figure in HP#%92s SAN lab. From your problem scenario, this is the most likely cause. To fix it, you will have to evacuate the data, reformat with an 8K allocation unit size, then restore the data. With an 8K allocation unit size, you can expect 4375 IO/sec to saturate a 1G fibre channel connection. Given your layout, this is right around the maximum capacity of you storage system. If you chose to tune the storage, you may want to consider a 2G infrastructure, or add additional HBAs to the server and use Securepath#%92s load distribution feature. If you use load distribution, tuning of HBA parameters will be required to avoid exceeding the 32 command LUN queue limit on the EVA. When you go to format the disk, you may want to consider performing disk alignment at the same time. Windows uses the first 63 sectors of a partition for hidden data. This data includes the volume information, MBR, that sort of thing. The geometry the EVA presents is 64 sectors per track numbered 0 to 63. With an 8K IO size, every fourth IO will cross a track boundary resulting in two IOs. That#%92s a 20% penalty. To resolve this, use diskpar to set the starting offset for the partition to 64. This will align the partition with the start of a track and eliminate the extra IOs. This issue is based on the geometry of the drive and affects locally attached storage equally. I suppose it#%92s time for the pitch. Windows is possibly the most highly instrumented, and configurable, server platform available today. There are literally hundreds of ways to optimize your server to suit your application#%92s workload. I#%92ve only briefly touched on a couple of them. I have some history with Windows 2000, SQL server, and the EVA storage platform. I#%92m also well versed in competing storage systems, and to some degree, other operating systems. You#%92re not the first to experience problems that can be solved through tuning IO. If you#%92d like to go further, and are in need of an independent storage consultant, just drop me a line. [<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ />] John Fullbright MOSMWNMTK [email protected]

Great post. A question about multipe logical drives on same physical disk. For example if you have 20 physical 9 GB disks and each disk is divided in 9 chunks of 1 GB (called hypers sometimes I think.). Then you do a RAID10 over 20 disks with one 1GB chunk from each disk and create a logical disk presented to the OS. Now you can create 9 such RAID10 devices over the disks and give to different servers. This seems fast to me but then I assume that if one server is pounding it’s logical disk it could effect all other logical disks as well using the same physical disks. Would it instead be better to give each server their own dedicated pysical disks, like a dedicated RAID1 consisting of two 9GB disks? What’s the pros and cons here in terms of performance and flexibility? /Argyle

quote:Originally posted by Argyle Great post. A question about multipe logical drives on same physical disk. For example if you have 20 physical 9 GB disks and each disk is divided in 9 chunks of 1 GB (called hypers sometimes I think.). Then you do a RAID10 over 20 disks with one 1GB chunk from each disk and create a logical disk presented to the OS. Now you can create 9 such RAID10 devices over the disks and give to different servers. This seems fast to me but then I assume that if one server is pounding it’s logical disk it could effect all other logical disks as well using the same physical disks. Would it instead be better to give each server their own dedicated pysical disks, like a dedicated RAID1 consisting of two 9GB disks? What’s the pros and cons here in terms of performance and flexibility? /Argyle

Well, first off, I got confused on my math by trying to do two things at once in my previous post. For the RAID 5, there are 12 6 drive sets in the 72 drive group, not 8. The numbers are a bit low, but I’ll leave it to anyone interested to recalculate them. The second thing is when placing multiple LUNs in the same group, for logs you change IO pattern form Sequenal to random and lose the cache effect. Hypers are the EMC Symmetrix terminology. The Symmetrix architecture differs in a fundamental way, that I’ll get to in a monment. In the meantime, let’s stick with slice. It more accurately descibes what’s going on in the EVA. No matter what, you have 20 physical 9GB drives. Assuming 7200 RPM drives, each should be able to handle about 80 IO/sec, or 1600 IO/sec for the group. When we throw RAID 10 on top, this is what the EVA does even though the GUI says RAID 1, we’re looking at 1200 IO/sec on average. The problem of carving LUNs out of the same group is that there is no good way tho limit the IO against any one LUN. A single SQL server doing a DBCC can adversely affect every other server that has a LUN from the group. If you did two 10 member groups, each group would be capable of 600 IO/sec, but a heavy load in one group would not affect LUNs in the other group. You could break it down further and give each server a dedicated RAID 1 for the logs. You would have to greate multiple groups to do this. Today, the only effective separation on the EVA and similar architectures is at the group level. In the EVA, we can’t control which chunks are chosen to become part of a LUN. On the Symmetrix. you have complete control of mapping hypers to LUNs through the bin file. This requires hand tuning, but can be a very effective way to limit cross LUN loading. This is one of the fundamental differences between the Symmetrix and the EVA. At the HBA, you can modify queue depth. This represents the number of tagged commands allowed to be outstanding against a specific target. Note that this is at the target, not the LUN, level. You can create a sort of crude IO governor by manipulating this value for all HBAs connected to a given port on the storage processor. I believe the EVA port depth is 256. The sum of the queue depths on all HBAs connected to that port cannot exceed the port depth of the port. Internally, most architectures, including the EVA have a 32 command queue depth for each LUN. No HBA, or combination of HBAs if using load distribution, should exceed the LUN queue depth. This will cause errors on the EVA. IO times, as seen in perfmon, include the time the command hangs around in a queue. One way to reduce bad times to an acceptable level is to lower the HBA queue depth. HP normally sets the depth at 30, or 15 if multiple HBAs are used. EMC sets the default for 8. You sacrifice a small percentage of the available IO/sec to get response times that are acceptable to the application. Hopefully, now you understand the reason for that. In the above cases, we’re talking an Emulex HBA. Qlogic gets the queue depth from the registry. It’s the numberofrequests value. They have a thottling mechanism of their own, and do not recommend changing the queue depth. The problem is, that in shared storage situations, troubleshooting or setting the values requires touching every server connected to a given port. On something like a Symmetrix, or a HDS 9900 series, this can be a nightmare. In response to the management challenge, SNIA sponsors the HBA API initiative. The first product out of the chute that I’ve seen is Emulex’s HBAnywhere. This will greatly simplify the tuning process.
John
MOSMWNMTK
[email protected]

John, I would suggest you to go ahead and write an article on this. This way all the effort in making the post visible to more people. Gaurav
Moderator
Man thrives, oddly enough, only in the presence of a challenging environment- L. Ron Hubbard

An article with some pictures describing the different technologies would be nice [<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ />]. I’ve been trying to learn more details about storage solutions like EMC etc but haven’t found any good sources. Some of the things you talk about here are still a bit over my head. In our organisation more and more environments with SQL Server are planning to go over to big SAN solutions. An being a DBA it’s in my interest to know as much as possible about the different technologies [<img src=’/community/emoticons/emotion-2.gif’ alt=’

‘ />] /Argyle

My focus has been storage on the Windows platform for some time now. I deal with all the storage platforms, but tend to have more experience with EMC and HP; partly due to the numbers deployed in the field and partly due to the fact that they have both SAN labs reasonably close to my house. From the application side, I’d have to say I know Exchange in much more detail than SQL. I have had occasion to deal with Oracle and Notes also. The trend everywhere is to move to SAN technologies. In many cases, it’s a sign of the economic times. Companies will "leverage" their existing SAN infrastructure for the deployment of a new system. Usually, they’re just looking at the space available, and not thinking about the impact of the added IO load. I’ve been on site at fortune 500s on the order of a 100 times or so dealing with these issues. I’m currently have some down time, and wouldn’t mind writing a few articles. John Fullbright
MOSMWNMTK
[email protected]

]]>