SQL Server Performance Forum – Threads Archive

Disk/SAN config for ad-hoc heavy duty analysis use

Hello everyone, I’m leading a strategy consulting team where we are using SQL Server for ad-hoc analysis of large databases. I’m looking for advice on how best to tune the hardware we have to support this type of activity. We have 11 users writing ad-hoc SQL to perform heavy-duty analytics on retail data, both POS (point of sale) and loyalty card / transaction data. The POS data is ~900 million rows, and the card data is up to ~4 billion rows. The data is sourced from Oracle, where we have performed initial processing and aggregation prior to loading the data into SQL Server. The type of work makes heavy use of one-off large transformations. We do things such as join very large tables together (e.g. join a 900 million row table to itself) and run one-off aggregate mathematical functions (e.g. aggregating a 4 billion row table to create a new 6 million row table). Our server specs are: IBM xSeries 445, 4x 3Ghz Xeon, 8GB RAM, AWE and /3GB enabled
Windows 2000 SP4 Advanced Server, SQL Server 2000 Enterprise using 7.5GB RAM
Two small RAID1 local disks for OS, apps, and pagefile (no paging activity)
2TB of SAN storage accessed via dual QLogic QLA23xx fibre channel cards The SAN is currently configured as 2x 800GB LUNs plus 32x 10GB LUNs from a "pool", which appear as 34x IBM 2105800 SCSI disk devices. These 34 LUNs have then been configured as a single Windows spanned drive on which is stored data, log, and the tempdb files. I queried this setup as previously I have configured separate physical drives for data, log, and tempdb when using local SCSI RAID arrays, but was told that the SAN would not have a problem with this setup. At present, I do not have further details of the back-end SAN configuration (disks, speed, RAID configuration) or other users of other LUNs, but may be able to find this out if you think it is important. I have been told it has a 16GB RAM cache, and that the 2x 800GB LUNs at least were ordered from IBM. We have about 8 SQL databases, ranging from 100GB to 600GB. We need to be able to run 1-2 large ad-hoc queries on each database simultaneously. There are no other users of this server. Some single queries can easily add 50 GB+ to the log files of the databases. Using performance monitor under initial typical activity shows no paging activity, cpu use averaging 30-50%, total disk reads/writes averaging 45 MBytes/sec with occasional peaks of 80-100 MBytes/sec. It appears to me that our type of use is disk-bound, but I have no good benchmarks or further information to tell if we are getting the maximum performance out of this system. I would greatly appreciate any thoughts you have. Right now, our project is server bound, and I have a number of highly-paid analysts sitting waiting for their queries to finish. Many thanks,
David Trounce
[email protected]

i am not sure i understand why everything is on the 32x10G LUNs?, what are the 2x800G LUNs for?
this configuration is frequently advocated by SAN reps who know absolutely nothing about database and has on several occasions (that i have personally be called in on) resulted in ridiculously bad performance: 3K peak IOPS, 70MB/sec, 300ms latency.
On removing the small LUNs and other bad configuration recommended by the vendor, this came to 8K IOPs, 150MB/sec, 50ms latency, which was still not good considering the price paid. A different SAN generated 19K IOPs, Furthermore, if you are running very large DW queries, Sequential bandwidth is the key, your target should be ~1GB/sec capability, which won’t happen on 2 FC links.
The 16GB cache on the SAN has nearly zero value given that you have 7.5G for SQL I am giving a session on this topic at the spring SQL Connections conference if you care to attend.

Joe, Thanks for your insight – your comments are extremely helpful. It certainly feels that this server [4 cpus, 8GB, 2TB SAN, cost >$100k for server+disk] is not even running at the speed of a server we have previously built for <$10k [2 cpus, 4GB, 2TB local storage using 2 RAIDCore serial ATA controllers w/ 15 WD 250GB ATA disks in a 3U, 3 separate arrays for log (RAID1), tempdb (RAID0), and data (RAID5)]. However, it is good to get an external perspective and benchmarks. re: the 32x10G and 2x800G LUNs, that is how the SAN disk has been made available to us. I suspect that the 32x10G LUNs are located on existing storage and the 2x800G LUNs are located on new storage that was purchased exclusively for this server. There is a single Windows spanned disk (E:, used for data, tempdb, and logs) over these 34 LUNs. Since the disk is not even striped across the LUNs, I’m guessing that database operations are using only a few of the LUNs at any one time. This doesn’t strike me as the best way to use the potentially available bandwidth. I have now had the opportunity to run SQLIO on our SAN (albeit with one other query running simulataneously – it’s hard to find dead time right now to run benchmarks), and get the following: SQLIO config:
* 64 GByte test file (4x greater than what I believe is a 16GByte SAN cache)
* Sequential reads of 64KBytes for 6 mins
* 8 outstanding IO requests, 4 threads for 4 cpus SQLIO results:
* 565 IOs/sec
* 35 MBytes/sec
* 56ms average latency (min 5ms, max 1097ms) This doesn’t strike me as very good, given your benchmarks. As a result, I have asked the IS group the following questions: * What are the physical drives in the SAN (size, speed, fiber?)
* What disks compose what RAID groups on the SAN (RAID type and size/disk type and size)?
* What LUNs are composed of what RAID Groups? What RAID type is used to build LUNs? How are the LUNs divided between RAID groups if this was done? Can I ask – how would you configure physical drives, arrays, LUNs, logical drives, and log/tempdb/data file placement to maximize sequential throughput for this kind of large-scale data-mining type use? Many thanks,
David

i think we are using almost the same development system, a supermicro chassis for 15 SATA drives (i shoved a boot drive by the power supply so all 15 SATA drives carry DB load) with the RAIDCore controllers.
I am using the WD 74 10K drives, which is only 1/4 the capacity of your, but has slighty higher sequential transfer rates (70MB/sec vs 60MB/sec)
out of the entire array, i can get 1GB/sec sequential reads,
SQL Server table scan generates ~350MB/sec, probably need various partitioning tricks to get more. Most mid range SANs have good random IO capability, but are not designed for high sequential IO, especially considering the capabilities of modern systems,
35MB/sec is way low, should be 160-170MB/sec per FC loop, example: the CX700 can support upto 4 loops for 600MB/sec+ I would rebuild the entire SAN configuration from scratch, allocated entire groups of disks for data (one or more volumes) temp & log.
if this is a DW, recovery model should be simple to minimize logging. What is your specific SAN vendor/model? sometimes the person installing it is following a set procedure really menat to prevent hot-spotting in random loads, which leads to really crappy sequential performance

Hey Joe, You’re right – we are also using the SuperMicro 15-drive serial ATA 3U chassis for our development/archive server I referred to earlier. It’s obviously a good choice given that you use it too! I have some more information on the SAN now: We have 2x IBM TotalStorage ESS 800 units, each with 16GB/sec of aggregate fiber bandwith – though only 2x 2Gbit/sec to our server through our 2 HBA cards. ESS1:
* Dedicated to our use
* 2x 800G LUNs, each consisting of a single disk pack, each with 8x146GB 10,000RPM disks (RAID5: 6 data + 1 parity + 1 spare) ESS2:
* Shared with other users
* 32x 10.5G LUNs spread over 9 disk packs, each with 8x73GB 10,000RPM disks (RAID5: 6 data + 1 parity + 1 spare) I’ve been told by our IS support that any given disk group in the ESS is limited to 40MB/sec. However, I don’t understand why this might be true, given that there are effectively 6 spindles per disk pack, and each physical drive should be capable of serving >40MBytes/sec in parallel. This would then give at least 240MBytes/sec per disk pack. If this is true, this would be a severe bottleneck under any reconfiguration. Clearly, we could reconfigure the 2x 800G LUNs in Windows as two separate data volumes, and figure out how to make best use of the shared 32 LUNs for log and tempdb volumes. However, given this configuration it doesn’t look like there’s much to be gained by resizing the LUNs, and in fact resizing the 10.5G LUNs on the shared ESS is likely to be a non-started, given the other users. However, I’m wondering if we might not switch to a directly-attached serial ATA solution instead. I’m thinking along the lines of something like the following: * 32x Hitachi 7K400 400GB 7200RPM Serial ATA 8MB cache HDS724040KLSA80
– 32 @ ~$322 each (retail pricing before volume discount) = $10,304 * 4x Broadcom BC4852 8-Channel RAIDCore Serial ATA RAID Host Bus Adapter 133Mhz PCI-X
– 4 @ ~$309 each (retail pricing before volume discount) = $1,236 * Plus hot swap disk enclosures and cables to house 32 3.5" serial ATA disks
– Total cost: ~$13,000 Configuration:
* 5x 6-disk 2TB RAIDCore RAID-5 arrays for 10TB total redundant storage (30 disks) + plus 2 hot spares
* Flexible configuration gives 5 physical separate volumes for placement of data, log, and tempdb files to maximize sequential performance
* Arrays can be striped together to generate >1GByte/sec sustained RAID-5 reads and >300MBytes/sec sustained RAID-5 writes Comparison vs. SAN solution:
* Storage: 10TB serial-ATA RAID vs 2TB SAN
* Cost: <$15k serial-ATA RAID vs. >$50k SAN
* Sequential read: >1GByte/sec serial-ATA RAID vs. ~300MByte/sec SAN
* Sequential write: 300MByte/sec serial-ATA RAID vs. ~300MByte/sec SAN This has the following advantages:
* It puts in place a more appropriate, faster, bigger, cheaper storage solution. Potentially recover some of the >$50k cost of the 1.6TB second ESS
* Not lose 48+ hours of server time (plus risk of data loss) to back up the data (currently 1.5TB), reconfigure the SAN, and restore. Rather, we could just detach the databases and copy them straight across to the new local RAID disk However, one obvious requirement is the need to find 2-4 external enclosures to house 32x hot swappable serial ATA disks, and figure out how to cable these to the existing IBM xSeries 445 server. We could always get 2x 16-disk 3U/4U or 3x 12-disk 3U server cases, and just use them to support the SATA disks. However, the cabling (32x SATA cables to the server) might be a little messy. However, after extensive web searching I have not yet found a good alternative for external SATA enclosures or cabling simplification. What do you think of this configuration – do you think it would have good performance for this type of application? Have you come across any better external enclosure solutions to support this number of disks? Cheers,
David

i do not have direct experience with the ES800, but that behavior is similar to the Clariion SANs, I suspect the SAN vendors may have implemented some kind of alogorithm to randomize actual storage locations, so the even sequential IOs behave as random IO. presumable the purpose was to prevent hotspots in OLTP workloads.
in any case, each disk can do 150 64K IOPs, or 9.6MB/sec per disk, may be the IBM only does 100 64K IOPS for 6.4MB/sec + plus RAID overhead netting 40MB/sec per 8 disk group While I do like SATA for DW apps & development, using it in production is a tricky matter, partly because IT staff often don’t like white boxes.
In any case, the cable limit for SATA is 1 meter, but the only cables i have seen are 36in. if you can put 1 storage enclosure above and below the x445, this might just work.
I have run 32 drives by attaching 15 drives to the Supermicro 3U chassis and additional disk in a 4 disk enclosure directly to a power supply, but it is a cabling mess, and may be difficult in a rack However, I would also suggest you consider a SCSI solution, which is fully supported by IBM, you can also go with white box (build it yourself).
The IBM ServerRAID 6M is a decent adapter (I use the LSI version of it).
Each SCSI enclosure holds 14 disks, which can be split into 2 SCSI channels.
You are probably looking at $1K per adapter, plus $8K per 14x73GB 10K SCSI drive+enclosure, or $10.5K per 14x146GB 10K SCSI drives,
With white box, it will be $400 per 73GB disk and $600 per 146GB disk including the adapter and enclosure parts. this is more expensive than the SATA solution, but is less likely to have cable length problems.
if you are not buying this right away, SAS products should be available is a few months, In any case, the ESS 800 is totally inadequate for your DM app as you will get better performance from 4x400GB SATA drives (60MB/sec sequential per drive) compared to the 35MB/sec from the IBM SAN.

]]>