SQL Server Performance

RAID config advice - long

Discussion in 'Performance Tuning for Hardware Configurations' started by paul-w, Mar 3, 2008.

  1. paul-w New Member

    Hi
    We have a dual quad core processor HP DL380 server with 2 x Storageworks MSA 50 arrays attached. Each has 10 drive bays. We have an HP P800 storage controller. There are 8 bays in the server itself controlled by a P400. It's running 32-bit Windows 2003 Enterprise with, currently SQL 2000 SP4 Standard (to be upgraded to Enterprise). There's 4 GB RAM (I plan adding another say 8 GB when we move to Enterprise SQL). Couldn't move to 64-bit Windows and SQL 2005 because of time constraints - the amount of testing required.
    Initially I had several 4 disk RAID 10 arrays but performance was very poor. I decided that the data wasn't being split over sufficient physical drives so reconfigured. We now have:
    1st MSA 50 : Drive E, 10 x 146 GB 10,000 rpm SAS drives, RAID 10, 680 GB. This stores 5 databases occupying around 280 GB so nearly 60% free space.
    2nd MSA 50: Drive F, 4 x 146 GB drives, RAID 10, 270 GB. This stores 7 databases occupying 64 GB so around 75% free.
    2nd MSA 50: Drive H, 2 x 72 GB 15k rpm SAS drives, RAID 1, 68 GB. This stores the tempdb data and log files occupying 16 GB on average so around 75% free.
    2nd MSA 50: Drive I, 2 x 72 GB 15k rpm SAS drives, RAID 1, 68 GB. This is spare.
    Server drives: Drive M, 2 x 72 GB 15k rpm SAS drives, RAID 1, 68 GB. This stores the log files for all dbs apart from tempdb. About 50% free.
    Server drives: Drive C, 2 x 72 GB 15k rpm SAS drives, RAID 1, 68 GB. OS here.
    The server is for reporting. During the day users run queries on the databases which are mainly copies of live databases from the previous day. At night, various overnight processing takes place, including restoring the live data from the production servers across the network and running various batch processes.
    I've been running performance monitor overnight at 15 sec intervals and this is what led me to the conclusion that the I/O per disk was too high.
    Last night, after at least a month of adequate performance, we got this in the SQL log. As you can see, drives E, F and H all get a mention:
    2008-03-03 04:01:47.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [F:SQL2000AccountingData.mdf] in database [Accounting] (8). The OS file handle is 0x000003E8. The offset of the latest long IO is: 0x0000016e998000
    2008-03-03 04:01:47.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [H:Sqldata empdb.mdf] in database [tempdb] (2). The OS file handle is 0x00000564. The offset of the latest long IO is: 0x00000077de8000
    2008-03-03 04:01:47.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SQL2000REPORTING_Data.mdf] in database [Reporting] (14). The OS file handle is 0x000004B0. The offset of the latest long IO is: 0x00000289010000
    2008-03-03 04:01:47.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [F:SQL2000ProcessControl.mdf] in database [ProcessControl] (16). The OS file handle is 0x00000494. The offset of the latest long IO is: 0x00000000096000
    2008-03-03 04:02:37.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [F:SQL2000DASHBOARD.mdf] in database [DASHBOARD] (11). The OS file handle is 0x0000048C. The offset of the latest long IO is: 0x00000000254000
    2008-03-03 04:04:17.28 spid1 SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [H:Sqldata emplog.ldf] in database [tempdb] (2). The OS file handle is 0x00000478. The offset of the latest long IO is: 0x000000096ef000

    This was followed by many eventID 17052 messages, Error 17883, "Process 61:0 (1674) UMS Context 0x066BE1C8 appears to be non-yielding on Scheduler 0".
    Microsoft points to I/O problems being an issue - http://support.microsoft.com/kb/897284 - and also recommends turning off parallelism.
    Looking at the I/O, I have updated all the firmware to the latest versions. With the cpu still pretty quiet (though around 55% for around an hour each night and around 25% for another hour), I'm left with I/O or perhaps memory being the issue.
    Memory is definitely a problem - the buffer cache hit ratio drops to 70% during busy periods.
    I have done all I can with the drive E array - no more disks will fit in the MSA50. I could split the tempdb mdf and log files using the spare RAID 1 72 GB array. I could buy some more 146 GB drives and expand drive F. I don't particularly want to throw resources at the problem without being reasonably confident I'll get some results though.
    One thing I wondered was whether, because reads are over 90% of the activity, RAID 5 might be an option. Would I end up getting better read performance with say a 9 disk RAID 5 than a 10 disk RAID 10? Note that I'd have a hot spare with RAID 5, hence 9 available disks. My thinking is that the data would be spread over all 9 disks with RAID 5 whereas with the RAID 10, half the capacity would be copies of the data.
    Here's my overnight performance montior results. I still see what seems to be I/O bottlenecks.
    Drive E
    Heavy usage time period02:00
    Heavy usage finish03:00
    Space required281
    Total space683
    % free59%
    Number of disks10
    Disk Reads/sec1557.00
    Disk Writes/sec48.00
    Avg. Disk Queue Length39.00
    Average Queue length per disk3.90
    Recommended max queue length2.00
    % Disk time average3979
    Recommended max % disk time55
    Avg. Disk Sec/Read0.01
    Avg. Disk Sec/Write0.05
    Raid 10 I/Os per disk165.30
    Recommended max I/Os125.00
    Disks required13.22I'm working out the I/Os per disk using the following forumula: Disk Reads/sec + 2 x Disk Writes/sec / Number of physical disks, ie, 10.
    Drive F looks like this. Note that this 30 minute time slot is the only really busy period I could find:
    Heavy usage start05:40
    Heavy usage finish06:10
    Space required64
    Total space273
    % free77%
    Number of disks4
    Disk Reads/sec1058.00
    Disk Writes/sec0.58
    Avg. Disk Queue Length40.80
    Average Queue length per disk10.20
    Recommended max queue length2.00
    % Disk time average4083
    Recommended max % disk time55
    Avg. Disk Sec/Read0.03
    Avg. Disk Sec/Write0.06
    Raid 10 I/Os per disk264.79
    Recommended max I/Os125.00
    Disks required2.12
    Queue length per disk10.20
    Recommeded max queue2Drives H and M are lightly used:
    Drive H
    Database1tempdb
    Heavy usage start02:00
    Heavy usage finish03:00
    Space required16
    Total space68
    % free76%
    Number of disks2
    Disk Reads/sec15.30
    Disk Writes/sec15.20
    Avg. Disk Queue Length1.08
    Average Queue length per disk0.54
    Recommended max queue length2.00
    % Disk time average108
    Recommended max % disk time55
    Avg. Disk Sec/Read0.00
    Avg. Disk Sec/Write0.03
    Raid 10 I/Os per disk22.85
    Recommended max I/Os125.00
    Disks required0.18
    Queue length per disk0.54
    Recommeded max queue2.00Drive M
    Database1Log Files
    Heavy usage start02:27
    Heavy usage finish02:34
    Space required47
    Total space68
    % free31%
    Number of disks2
    Disk Reads/sec0.00
    Disk Writes/sec27.00
    Avg. Disk Queue Length0.33
    Average Queue length per disk0.16
    Recommended max queue length2.00
    % Disk time average32.80
    Recommended max % disk time55
    Avg. Disk Sec/Read0.00
    Avg. Disk Sec/Write0.01
    Raid 10 I/Os per disk27.00
    Recommended max I/Os125.00
    Disks required0.22
    Queue length per disk0.16
    Recommeded max queue2.00
  2. satya Moderator

    Welcome to the forum!
    Ok, you have been ranting about hardware and resources on the system, have you looked at what kind of queries, processes and scheduled jobs are running during these busy times. At all the times I see many out there simply blame on the SQL and never look at the queries how badly they are written and that will by default cause such a spikes or issues even for a simple SELECT statements.
    http://sqlserver-qa.net/blogs/perftune/archive/2007/12/19/2981.aspx about setting a baseline of performance at your end and proceed further.

Share This Page