High-performance ad-hoc OLAP on multi-TB databases | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

High-performance ad-hoc OLAP on multi-TB databases

All, We are looking for a single, dedicated, high-performance and high performance-per-dollar machine to host a large (4TB+) ad-hoc data mart for a small team of analysts (<10 people). We will first load large amounts of data (including at least two 1TB+ tables with 1bn+ records) from multiple external data sources. We will then spend several months running very large ad-hoc OLAP-type queries that join these tables and subsequent derived tables together for investigate-type analysis. We expect to have up to 2-3 queries running at the same time. There will be no OLTP activity and no need to support any applications with response-time guarantees. The analysts will be able to coordinate their activity to share the resource if necessary.
It seems to me we have the following requirements: Disk requirements:
* Large disk volumes (aggregate 5-10TB+) for very large databases, source data, temporary working space, online backups
* High speed sequential reads and writes for fast full table scan performance – >1GByte/second SQLIO sequential reads/table scans and writes
* Random reads/writes and disk-side caching are less important, given data volumes and access characteristics Thus, so far I have recommended the following technology:
* Direct-attached storage (DAS, not SAN or NAS) (required for streaming bandwidth requirement, also minimizes cost and complexity)
* RAID10 – this load might look more like 1MB/s random reads/writes, given the parallel multi-user sequential loads, so this might work better than RAID5 or RAID6, but I’d be interested in feedback
* SATA disk (required for disk space requirement, also minimizes cost)
* High-performance RAID controllers on separate PCI Express buses (required for bandwidth)
– in particular, the new Areca cards based on the Intel IOP341 processor look very promising with >800MB/s sequential reads and >750MB/s sequential writes on RAID5/6. I don’t know how they will perform on RAID10, which might be a better solution
Processor / RAM requirements:
* Large RAM (16GB+) for SQL database-side caching, and several RDP (remote desktop protocol) user sessions (to provide for client disconnects / reconnects to running queries)
* High CPU performance (8 processors), to support complex joins
Software / licenses:
* Windows Server 2003 R2 Standard x64
* SQL Server 2005 Standard x64
* Windows Terminal Services (10 RDP licenses)
* High-speed backup software to generate weekly online compressed backups onto separate internal disk volumes: LiteSpeed, SQL Backup, or similar
* Backup software / scripts to perform weekly backups of these compressed backups to tape or near-line external storage (up to 1TB of external backups at full scale, less initially)
This type of hardware looks more like the hardware used to support the new class of data warehouse appliances (e.g. DATAllegro, Greenplum/Sun Fire x4500), rather than traditional, OLTP servers. However, we have found that is even more cost-effective to build a "white box" server using commodity parts, rather than buying a pre-packaged solution. The following references may also provide some interesting background materials:
"Pare Down and Power Up"http://www.sqlmag.com/Articles/Print.cfm?ArticleID=49011
Homepage of Dr Jim Gray, Microsoft Researchhttp://research.microsoft.com/~Gray/
Example of a possible solution: For example, the following system (10TB usable disk, 16GB RAM, 8 Xeon processor) could be purchased today athttp://www.thinkmate.com/product_info.php?cPath=2104&products_id=21904&customize=true&go=true for just under $25k. There would be a 10 business day build/test time. i2SS40-8U
Processor: Quad-Core Intel® Xeon® 5355 2.66GHz 1333FSB 8MB Cache x 2 (8 processor cores)
Motherboard: SUPERMICRO X7DBE – motherboard – extended ATX – 5000P (any concern over this motherboard’s RAM bandwidth?)
Memory: 2GB FB PC5300 667MHz DDR2 x 8 (16GB RAM)
Rackmount: AIC RSC-8E-2 w/ 1350W 3+1 hot-swap redundant PFC (8U)
Floppy Drive: 1.44Mb 3.5" Floppy Drive Black
Optical Drive: Sony 16x DVD-ROM
Controller Cards:
– Areca PCI-Express ARC-1231ML SATA II RAID controller (Intel IOP341) – 12 ports (on a dedicated PCI-Express bus) x2
– Areca PCI-Express ARC-1261ML SATA II RAID controller (Intel IOP341) – 16 ports (on a dedicated PCI-Express bus)
Hard Drive(s)
– 80GB ATA-100 7200 RPM – Seagate Barracuda 7200.9 x 2 (80GB RAID1 for SYSTEM/SWAP) on motherboard SATA
– 500GB Serial ATA 7200 RPM – Seagate Barracuda 7200.10 x 24 (6TB RAID10 for DATA) on 1st and 2nd Areca controllers
– 500GB Serial ATA 7200 RPM – Seagate Barracuda 7200.10 x 12 (3TB RAID10 for TEMPDB/BACKUP) on 3rd Areca controller
– 500GB Serial ATA 7200 RPM – Seagate Barracuda 7200.10 x 4 (1TB RAID10 for LOG) on 3rd Areca controller
Network Interface: built-in dual-port Gigabit on X7DBE motherboard
Operating System: Microsoft Windows Server 2003 R2 x64 (Standard Edition)
SQL Server 2005 x64 (Standard Edition)
Warranty and Support: Thinkmate Server 3 Year Warranty
It is true that this type of solution is less expandable than a larger corporate server (e.g. IBM xSeries) and/or corporate disk solution (e.g. SAN disk array). However, it seems to me that this type of solution is an order of magnitude cheaper than a more traditional server and disk solution with similar or worse levels of performance. Thus, we could buy several such machines to ‘scale out’ if it proves necessary to expand for the future – which will be at lower prices and/or better performance, given future improvements in performance-per-dollar. It is also true that in the past, SATA disks have not been as reliable as SCSI disks. However, a new class of enterprise-quality SATA disks are now being produced, to support 24/7 operations. Also, much of the workload will be sequential rather than random in nature, which reduces the stress on the drives. We also plan to use RAID10 to survive individual disk failures, as well as a combination of both online and offline backups in the case of the extremely unlikely simultaneous failure of 3 or more critical disks. If this disaster scenario happens, we can survive the several day downtime required to restore. We are getting a lot of pushback from the corporate IT department, which has a hard time understanding the usage profile (ad-hoc rather than application; small # of sequential streaming table scans rather than large # of small queries served mostly from SAN cache; no guaranteed service levels), and also is pushing back heavily against my current recommendations (SATA as DAS rather than SCSI as SAN disk; white-box rather than Dell or HP). I want to be flexible with the corporate IT standards, but I don’t want to give up on performance (critical to minimize query time) and also preferably performance-per-dollar (important, but less so). Where we might be heading is for an HP or Dell server with SAS enclosure-based disk (SCSI or SATA). However I am concerned with the RAID performance of the Dell and HP SAS controllers – the Dell PERC 5/e appears to be based on the IOP333, and I worry about the sequential performance I will achieve (Dell states 650MB/sec sequential read and 300MB/sec sequential write). I’d be very interested in feedback and comments on the hardware and/or configuration to support this type of load, and/or any good alternatives. Many thanks,
David

1. Go with a HP ProLiant ML370 G5 or Dell PowerEdge 2900
use the configurations similar to TPC-H or TPC-C reports with some substitions
a. use 2GB DIMMs instead of 4GB
b. use 4-6 SAS RAID controllers (4 for Dell, 6 for HP)
c. your choice of 73GB 15K or 146GB 10K drives
d. 8 or so external SAS storage units of 12-15 disks each,
should cost around $60-70K, don’t quible about the price, a SAN is much higher
white box vendors just do not have the expertise to support very heavily loaded disk configurations. 40 or so 7200RPM SATA disks seriously does not cut the mustard do stay away a SAN, the worse thing is that the SAN engineer or vendor may want to contribute their recommendations 2. DW/OLAP is highly sequential, but not purely large block sequential, hence the 10K/15K SAS drives will really help 3. RAID 5 will probably work, RAID 10 has better small block random write, but RAID 5 is actually better at large block sequential write 4. file placement, instead one group of disks for data, a second for temp and backup,
combine the disks with 3 partitions, 1st for data, second for temp, third for backup
Joe, First, many thanks for your advice. I do think we will end up with something like what you suggest, especially SAS using SCSI disks rather than either a SAN or DAS using SATA disks. I did have a few follow-up questions, if you had a few minutes: 1. From everything I’ve read (including a great deal from you in various posts on this site – thank you so much for those) it seems that for DW/OLAP the constraint is often the RAID controllers, and their ability to sustain sequential bandwidth. Do you happen to know the actual achieved bandwidth you get from the HP and Dell controllers? On Dell’s website, hidden away it suggests that the PERC 5/E gives 650MB/s sequential read and 300MB/s sequential writes. I can’t find any specs for the HP P800 – HP is very careful to only refer to relative performance vs. prior controller cards. It seems to use an LSI Logic processor, for which I’ve heard worrying stories about sequential bandwidth. Do you have any data on the HP P800, and/or any real-life benchmarks on the PERC 5/E for DW/OLAP loads?
2. Is the 4 controllers recommendation for the Dell driven by a limited number of PCIe buses? In which case, is the Dell limited to 1.2GB/s sequential writes using the PERC 5/E? Do you know the HP equivalent?
3. You say to use 8 external SAS units of 12-15 disks each – presumably HP MSA60 (12 disks in a 2U) or Dell MD1000 (15 disks in a 3U). However, wouldn’t just one shelf of 12-15 disks max out the sequential read and write performance of a single controller? Is the advantage of going for more than one shelf per controller just for random read and write performance, given the limited number of PCIe slots?
4. I understand the rational re: using RAID5 rather than RAID10 – I think we will do that. Would you ever consider using RAID6 rather than RAID5 plus a hot spare (for reliability), or do you think that the write performance hit makes it not worth it?
5. On file placement, I also understand the rationale for combining the disks with 3 partitions, 1st for data, second for temp, third for backup. However:
a) This seems to reduce the amount of sequential workload (e.g. data to temp, data to backup etc.). Is the increased # spindles (which seems to benefit mostly random work rather than sequential work) at the cost of more random activity really a benefit?
b) If you lose multiple disks simultaneously, you seem to be in danger of losing both the data and also the online backups?
c) Where would you put flat files for BULK IMPORTS? on partition 1, 2, or 3? Do you know if it goes to TEMP first, or straight to DATA?
6. Finally, I’m just trying to understand your advice against using SATA disks (I think we will end up using SCSI, due to IT preferences). I think you are saying that using only 40 disks with a 7.2K spindle speed will give insufficient random performance. However, presumably the sequential performance will still be very good, especially using controllers with the IOP341-based processor? (750MB/s sequential writes per controller). Could you get better random performance by increasing the number of disks? e.g. how about 8 shelfs of 14 SATA disks (112 disks). Given the IOPS/s difference between SCSI and SATA, it seems that you could have 1.5x the number of SATA disks as SCSI disks, and get similar random performance, better sequential performance, much greater amounts of free space for ad-hoc work, all at a similar or lower price. Thoughts? Many thanks,
David

1. you just have to understand what each controller can do,
do not assume that because it has a PCIe x8 slot that it can do 2GB/sec in each direction simultaneously
there is also a system limit, i think the Intel 5000P chipset can do 3GB/sec I know the PERC5’s can do about 800MB/sec on read, which is probably the realizable limit of a PCIe x4
I think this is why HP was smart enought to offer the default config of 6 PCIe x4 slots instead 2 x8 + 2 x4 that other vendors offer
they knew enough to do actual testing i think the P800 is really meant to drive more disks for peak random IO, not necessarily to get more sequential, but HP never sent me any test equipment 2. I generally like to max out the PCI-e slots, 4 for Dell PE2900,
6 for HP ML370G5.
I would not complain about 3GB/sec read, 1.2GB/s+ write
few people even bother to validate capability beyond 200-300MB/sec even when they buy a very expensive san 3. in pure large block sequential you could get 80MB/sec per disk, meaning 10 disks could max out the controller.
but in many SQL ops, you will get 30MB/sec per disk, so go ahead and connect 24-30 disks per controller 4. test it, too many talk redundancy but never practice it, ie, they never bother to practice a online disk replacement.
when it actually happens, they goof it, so it is as if they never had redundancy even though they paid for.
talk is cheap, do it 5. don’t fret this, with enough disks, your backup performance will be fine
use the end of the SAS disk for fast backup, put flat files there too
consider getting a few big mother SATA drives in the extra system disk bays to copy the fast backups 6. consider buying the system, 1 rack of SAS, 1 rack SATA,
then run a full set of tests
i think you will see what i am talking about there was another guy who compared new 7.2K SATA to older 15K SCSI, a few tests favored new SATA, but other key tests heavily favored 15K
he just could not understand the difference between small and large block io characteristics

Joe, Thank you again – this is really, really helpful. I think we will end up going the SAS route with 4-8 shelves of 10k disks for the main server, and maybe get a separate cheap white box with SATA disks to be used as a separate development box. I do have two last questions if you have time: 1. How would you configure the RAID disk arrays? For example, suppose you had 8 shelves of 15x 146GB 10k disks, and were going to use RAID5 (or presumably RAID50). Would you go: a) 8 shelves of (2 LUNs of (7-disk 876GB RAID5), plus a hot spare), with three Windows RAID0 striped volumes across the 16 LUNs for data, temp, and backup/source b) 8 shelves of (1 LUN of (14-disk 1898GB RAID5), plus a hot spare), with three Windows RAID0 striped volumes across the 8 LUNs for data, temp, and backup/source c) Something else? Alternatively, would you use RAID50 on the controller (if it was available) rather than use Windows stripes of RAID5 LUNs?
2. You say in another article not to use 300GB 10K SAS disks. However, if you wanted to get more working space to support ad-hoc work/potential expansion, would it be reasonable to use the same number of 300GB 10K rather than 146GB 10K SAS disks – potentially creating more stripes for file placement if necessary? Many thanks,
David

1. my preference is either of 2 x 7 disk LUNs per enclosure + 1 hot spare
or since this is a DW, not OLTP, 3 x5 disk LUNs,
then just buy a few extra disks to keep in the cabinet,
when a disk fails, change it
also, be sure to PRACTICE changing the disk before release to production. DO NOT USE OS STRIPING
create 3 partitions on each LUN, the data first, temp second, backup/source last.
if you can accurate size your data & temp partitions, use a raw partition
i actually like 4 partitions, main db in 2 filegroups, each getting a partition
+ 1 for temp and 1 for backup
the big table gets its own file group up front, all other tables in the second FG use the SQL Server capability of splitting a Filegroup into multiple files
put 1 data file on each lun, same with temp there is actually a trick of getting the critical data into the fastest part of the disk
do as above, but create an additional file group in the backup location, the slowest part of the disk
load the data into a table on the slow filegroup
then create a clustered index on the fast file group, which moves the entire table
do this just for the big table, don’t bother for the others 2. Until you get to 6 controllers + 12 racks of SAS drives (on HP ML370G5, 4 controllers 8 racks on Dell PE2900),
stick to 146G 10k or 73G 15K my thoughts are:
never buy a system with the intent to use for 4+ years as a primary system.
buy for 2 years, then buy a new system for the next 2 years, rolling the previous system to development/QA lets see: 1 rack of 15 146G (decimal ie 146×10^9 = 136GB binary, 2^30)
assuming 3x5disk RAID 5 LUNs
you have 3x4x136 = 1631GB on each rack, or 543GB per LUN
on each LUN i would probably allocate:
160GB for data (possib4ly in 2 partitions) (approx 30% which is the really fast part)
~30GB for tempdb (you have to figure out what you need depending on your app)
leave 350G+ for backup, source & extra
be sure to use LiteSpeed for backup, later i will tell you the key tuning parameters this works out to 3.8TB data on 8 and 5.7TB with 12 racks in any case, carefully consider 2x 146G drives vs 1x 300G
the 2x drives will have higher performance (same capacity) so it is worth an extra 20-30%

Joe, Many, many thanks again for your help with all this. I wanted to give you an update on the outcome, and see if you had any final thoughts, particularly with respect to the logical drive configuration. We actually have two analytical projects starting up with two different clients. At this point, based on your advice, we have ordered (for client B) and are planning to order (for client A) the following hardware: Client A: HP ML370 G5 (client is an HP shop)
– 2x Quad Core Intel Xeon X5355 (2.66GHz, 1333 FSB)
– 16GB RAM (8x 2GB)
– internal P400 controller
— 2x 72GB 15K SAS as 72GB RAID1 for OS/swap/apps
— 6x 300GB 15K SAS as 1.5TB RAID5 for LOGS/first-level backups
– 6x HP P800 RAID controllers
– 6x MSA60, each with 12x 146GB 15K SAS (total 72 disks) Options for logical configuration are
— (a) 12x LUNs of 730GB 6-disk RAID5 (note no hot spares), or
— (b) 12x LUNs of 584GB 5-disk RAID5; 6x 146GB single disks; 6x hot spares, or
— (c) 6x LUNs of 730GB 6-disk RAID5; 6x LUNs of 584GB 5-disk RAID5; 6x hot spares (a) clearly the fastest, but has no hot spares and thus is not preferred
(b) propose to use the 6x 146GB single disks for tempdb (over 6 SQL files)
(c) would use the 6x extra 146GB on the 6x larger LUNs as a software RAID0 for source data Any thoughts? In particular, are we better off with (b) to spread tempdb over 6 independent disks with no RAID penalty, leaving 60 disks for data or would we be better off with (c) to spread tempdb over the main 66 disks with RAID5 write penalties and random patterns, as well as an uneven number of disks across the 12 LUNs (6x 6-disk, 6x 5-disk)
Also, on the HP P800 cards: * Fromhttp://www.supercomputingonline.com/article.php?sid=12665, it appears that the HP P800 uses LSI Logic’s LSISAS1078 ROC (RAID-on-chip) I/O processor, which I believe might be a 500MHz PowerPC based processor.
* From LSI Logic’s website athttp://www.lsilogic.com/news/product_news/2005_03_23.html, it appears that the LSISAS1078 is almost 2 years old.
* From HP’s website athttp://h18000.www1.hp.com/products/quickspecs/12616_na/12616_na.HTML, it appears that the HP P800 has 512MB of read/write cache (vs. 2GB for the Areca controllers in the white box)
* From HP’s comments by email, "With RAID-5, the 64KB sequential write rate will max out with not too many (<12) drives at around 300MB/s. And the RAID-5 64KB sequential read rate will scale with drives until about 900MB/s.".
Client B: Dell PE2900 (client is a Dell shop)
– 2x Quad Core Intel Xeon X5355 (2.66GHz, 1333 FSB)
– 24GB RAM (12x 2GB)
– internal PERC 5/i controller
— 2x 300GB 15K SAS as 300GB RAID1 for OS/swap/apps
— 8x 300GB 15K SAS as 2.1TB RAID5 for LOGS/first-level backups
– 4x Dell PERC 5/E RAID controllers
– 4x PowerVault MD1000, each with 15x 300GB 15K SAS (total 60 disks) I remember your point about preferring 8x racks with 146GB disks over 4x racks with 300GB disks for 20-30% more cost, but cost (and space) was a concern. You don’t always get your preferred way! Options for logical configuration are
— (a) 12x LUNs of 1.2TB 5-disk RAID5 (note no hot spares), or
— (b) 8x LUNs of 1.8TB 7-disk RAID5; 4x hot spares, or (a) clearly the fastest, but has no hot spares and thus is not preferred
(b) probably the preferred option In both cases, we will use 2-4 partitions per LUN for file placement, as you suggest. We will also run LiteSpeed or SQL Backup first-level high speed compressed backups directly to the internal chassis RAID5 disk array, then back these files up to tape later. It seems we should use a 256KB stripe size, rather than a 64KB stripe size. This looks unusual, but is based on Microsoft’s recommendation of a 256KB RAID5 stripe size for very large SQL Server 2005 databases Fromhttp://www.microsoft.com/technet/prodtechnol/sql/2005/physdbstor.mspx#E4NAC The stripe size refers to the unit of data that is written and accessed from a disk in a RAID system. This is a configurable value that is set when designing the storage array system. A smaller stripe size allows data to be distributed to more disks and increase I/O parallelism. Note that the stripe size of a single SQL Server extent (64 KB) is the lower limit. For the same data, a larger stripe size means the data can be stored on fewer disks and decrease the I/O distribution and the degree of parallelism. We recommend a 64 KB or 256 KB stripe size for most workloads. When the workload includes table and index range scans on tables that are larger than 100 MB, a stripe size of 256 KB allows for more efficient read-ahead. Any thoughts on the stripe size?
Finally, though I don’t think we will go the white-box route, I did get a quote that still looks interesting: * 5U 48x SATA disk top-load AIC RSC-5D-2Q1 chassis
* Supermicro X7DBE Xeon 5000P Serverboard
* 2x Intel Xeon X5355 Quad Core 2.66GHz 1333MHz 2x4MB Cache CPU
* 16GB RAM as 8x Kingston 2048MB DDRII-667MHz FB-DIMM ECC Memory
* 48x 7.2K 500GB WD5000YS 16MB cache SATA disks
* 3x Areca ARC-1261 16-port controllers, each connected to 16 SATA disks
* Battery backup and 2GB RAM cache for each Areca card Total hardware cost – $28k, including shipping from serversdirect.com * The Areca cards use the Intel IOP341 (81341) I/O processor, which is an 800MHz XScale based processor http://www.intel.com/design/iio/iop341_42.htm)
* Areca claims this card will do >800MB/s sequential reads and writes – from reports, it seems to hit this with 10 or more 10K SATA Raptor disks Interesting links re: the SATA white-box route:
*http://forums.storagereview.net/index.php?showtopic=24663&hl=areca
*http://www.sqlmag.com/Articles/Print.cfm?ArticleID=49011 Hope this is interesting, and as usual, any thoughts would be most welcome! Cheers,
David

i see they couldn’t resist the 300GB 15K drives
oh well, you can lead a horse to water… see the post above on file placement Suppose there is 1 database with 2 data filegroups,
and that the application makes extensive use of tempdb
Then on each array group, create 4 partitions.
The first partition is for the most active main database file group.
The second partition is for the other main db filegroup.
The third partition is for tempdb
The fourth is for backup, and other uses. The idea for the picture below was stolen from bertcord Rack 1
ArrayGroup Partition 1 Partition 2 Partition 3 Partition 4
ArrayGroup1 FG1-File1 FG2-File1 temdb 1 backup1
ArrayGroup2 FG1-File2 FG2-File2 temdb 2 backup2 Rack 2
ArrayGroup Partition 1 Partition 2 Partition 3 Partition 4
ArrayGroup1 FG1-File3 FG2-File3 temdb 3 backup3
ArrayGroup2 FG1-File4 FG2-File4 temdb 4 backup4 Rack 3
ArrayGroup Partition 1 Partition 2 Partition 3 Partition 4
ArrayGroup1 FG1-File5 FG2-File5 temdb 5 backup5
ArrayGroup2 FG1-File6 FG2-File6 temdb 6 backup6 Rack 4
ArrayGroup Partition 1 Partition 2 Partition 3 Partition 4
ArrayGroup1 FG1-File7 FG2-File7 temdb 7 backup7
ArrayGroup2 FG1-File8 FG2-File8 temdb 8 backup8 Since there are 8 array groups,
each of the main db filegroups is split into 8 files,
same for tempdb Hot spares
I really do not feel that DW/BI really needs hot spare disks.
assuming that the server resides in semi-managed environment
keep 3-4 disks in a locked closet nearby
if a disk fails, have some one walk over to change it out,
a little excercise will not kill you (disclaimer: I am not licensed to practice medicine!)
Just make sure some one practices this first,
every raid controller is different,
invariably, the first try will be goofed stripe size
I always felt Compaq/HP knew their stuff with RAID
they default to either 128K or 256K
there is something in the OS that works well with 256K stripe sizes.
Test it if you can use the test scripts in the above post
table definition, population, indexes are on the first page
use the random/sequential/update tests from the 3rd page
(which also has a more comprehensive perfmance info collection table

]]>