SQL Server Performance

Please Help Analyze Collected Perf Monitor Stats

Discussion in 'Performance Tuning for Hardware Configurations' started by rweinstein, Aug 11, 2004.

  1. rweinstein New Member

    I set up Performance Monitor properly and ran it overnight through my nightly refresh, wrote the data to a .CSV log file and analyzed the results this morning. Would you guys take a quick look at my results and let me know if you have any conclusions or suggestions different from me?

    I run 2 servers, the SQL DB is on Server 1. Here is the config of the servers:

    Server 1 (SQL DB) - Not a dedicated Server for SQL
    2.6GB RAM
    4 X 2.8GB Processors
    10GB C: Drive
    135GB E: drive

    Server 2
    1.8GB RAM
    2 X 1GB Processors
    8.4GB C: Drive
    85GB E: drive

    Here are the statistics (the data was collected every 10 seconds):

    Physical Disc: % Disk Time
    Server 1:
    Exceeded 55% from 5:59pm - 7:35pm, 8:03pm - 8:11pm, 9:04pm - 11:19pm, and 11:33pm - 5:36am
    Server 2:
    Exceeded 55% from 11:31pm - 11:56pm
    Analysis: Need to increase I/O performance of Server 1 and possibly Server 2.

    Memory: Available Bytes
    Server 1: Never went below 459,333,632
    Server 2: Never went below 1,250,881,536

    Memory: pages/sec
    Server 1: Averaged 137 - 177 pages per second over 24-hour period
    Server 2: Averaged 51 - 67 pages per second over 24-hour period
    Analysis: Buffer Hit Cache Ratio never went below 99% and also running other applications on server, should remove other applications and have dedicated server.

    SQL Server Buffer Hit Cache Ratio:
    Server 1: Averaged 99.86% over 24 hours
    Server 2: Averaged 99.76% over 24 hours

    Physical Disk: Avg. Disk Queue Length
    Server 1:
    Averaged 4.6 from 6:04pm - 6:29pm
    Averaged 12.2 from 12am - 12:13 am
    Averaged 6.3 from 5am - 5:33am (SQL server DB backup during this time)
    Server 2: Barely ever reached 1
    Analysis: Possible I/O bottleneck, may need to increase I/O performance on Server 1.

    Processor: % Processor Time
    Server 1: Exceeded 1000% from 5:34pm - 7:35pm and 9:02pm - 5:35am (is this possible)
    Server 2: Exceeded 1000% from 9:06pm - 5:45am
    Analysis: CPU Bottleneck, more and/or faster CPU#%92s with L2 Cache. The numbers that I saw, in the thousands of percent continuously, is this possible?

    System: Processor Queue Length
    Server 1: Ok. Not more than 2 per CPU
    Server 2: Ok. Not more than 2 per CPU

    SQL Server Buffer Mgr: Pager Life Expectancy
    Server 1: Drops below and averages less than 300 for all of the following time periods:
    6 - 6:05 am, 5 - 5:08 am, 2:10 am, 1:36 - 1:37 am, 1:04-1:08am, 12:54-12:59am, 12:45-12:48am, 12:43am, 12:29-12:37am, 12:02-12:15am, 9:04-9:08pm
    Server 2: Consistently 2.8 Million - OK
    Analysis: Need more memory on Server 1


  2. Twan New Member

    CPU exceeding 1000% seems a little on the weird side!

    RAM seems ok, since there are hundreds of MB available at all times

    paging is high, but then you're not running a dedicated SQL srv, so that may just be normal file access


    what is your disk set up? RAID0, RAID1 RAID5? how many disks? Are you running Std edition of SQL? are we talking SQL2k on Win2K? Std edition of Win?

    server 1 dropping below 300s seems a little odd too

    How is the application performing? Does the database need tuning/indexing?

    Are you able to check the SQL buffer manager counters to see if the database is doing the page writes or whether it is other paging...? Also current and target SQL memory

    Cheers
    Twan
  3. Luis Martin Moderator

    How about SQL Buffer Manager: Page life expetancy?

    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All postings are provided “AS IS” with no warranties for accuracy.

  4. rlahoty New Member

    Luis, you probably did't notice this in the original post :

    "SQL Server Buffer Mgr: Pager Life Expectancy
    Server 1: Drops below and averages less than 300 for all of the following time periods:
    6 - 6:05 am, 5 - 5:08 am, 2:10 am, 1:36 - 1:37 am, 1:04-1:08am, 12:54-12:59am, 12:45-12:48am, 12:43am, 12:29-12:37am, 12:02-12:15am, 9:04-9:08pm"

    I am not sure what to conclude from this. It is going below 300 for a period of 5 minutes or so and then it is probably back to normal. Does it really tell us that we need more memory on server? In my opinion, this is not a memory issue as it is not consistent (quite sporadic, in fact). It can happen because of an intensive job or a long running query, right?
  5. rweinstein New Member

    During all the time periods of the page life expectancy going so low, this is when we are doing our heavy processing and running the intensive long queries.

    It seems like my trouble areas are:

    % disk time
    Page life expectancy
    Avg. Disk Queue length

    But what does this really tell me? Do I need to improve any hardware, like I/O performace?

    Both Server 1 and Server 2 have RAID5 drives for the E: drive and RAID1 mirror drives for the C: drive.

    I am running SQL 2000 Enterprise edition on W2K environments.

    I am still learning Profiler and have a hard time setting up and reading the SQL buffer Mgr counters.

    I also do know that my DB needs some tuning. There are a couple of places that I am looking at adding some indexes, but I am not done with my research there.
  6. Luis Martin Moderator

    According those values (page life) I don't see you have memory issues.
    If you are plannig to add index, I suggest to place in diferent filegruoup in RAID 1.
    RAID 1 is faster for writes than RAID 5. Of course if you have space in disk C.
    RAID 5 vs RAID 1 or RAID 10, allways generate some discussion in Forum. In my customers (I free lance) I have better results with RAID 1,10 than RAID 5


    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All postings are provided “AS IS” with no warranties for accuracy.

  7. rweinstein New Member

    Luis,

    The C: drive definitely does not have enough space for this use, but some other analysis shows that we have a lot more reads than writes on our DB, which indicates that RAID5 is better than RAID10, but I don't know too much about RAID1.

    From these stats, does it seem like we need more I/O, can any conclusions be drawn from this data?

    Thanks.
  8. Luis Martin Moderator

    I think no.
    I suggest to go on with actual hard and concentrate in tuning.

    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All postings are provided “AS IS” with no warranties for accuracy.

  9. timingskey New Member

    rweinstein, are the data and log files on separate physical drives? The disk queue length seems high to me if all of your data and log files are on e:.
  10. derrickleggett New Member

    Server 1:
    Averaged 4.6 from 6:04pm - 6:29pm
    Averaged 12.2 from 12am - 12:13 am
    Averaged 6.3 from 5am - 5:33am (SQL server DB backup during this time)


    Let's drill into these a little bit more. On Server 1 exactly what RAID arrays do you have? What type are they? How many disks are in each array? Are these an average number for ALL arrays? Can you give us by individual array?

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  11. rweinstein New Member

    Derrick,

    Thanks for the help, here are your answers:

    Server 1 has two drives, E: and C:. The numbers (4.6, 12.2, 6.3) are the total across both. THe documentation I read said to divide by 2, so the numbers result to (2.3, 6.1, and 3.15), which are all higher than '2' for long periods of time, the benchmark.

    Both Server 1 and Server 2 have RAID5 drives for the E: drive and RAID1 mirror drives for the C: drive and each server has dedicated Raid drives, they are not shared.

    I don't have the #'s by individual array, but can add these statistics to my daily counting.

    Please let me know if you have any other thoughts.

    Thanks.

  12. derrickleggett New Member

    You need to divide them out and let us know the individual counts per disk. How many physical disks are in the RAID 5 and RAID 1 arrays? I would assume 2 mirrored drives for C, so the big question is how many drives do you have on E. Also, are the database data and log files both on E?

    I'm just guessing here. I think you probably have a 3 disk RAID 5 array with the SQL Data and Log files both located on the same array. The C: drive is used for the OS and program files. Where is the tempdb located? If this is the case, when you start tracking the numbers seperate, you'll probably find the E: drive is vastly higher unless you use the tempdb a whole lot and it's on C:. Let us know. If this is the case, you are probably looking at some serious tuning time or new hardware. We can help with both.

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  13. rweinstein New Member

    Derrick,

    You are right on about the hardware and DB config.

    I activated the individual disks for Perf. Monitor for Avg. Disk Queue length and almost 100% of the total number is coming from the E: drive. I also checked the timestamps and the high figures are showing up at the same time periods.
    So, here are the stats:

    E: drive - Server 1 Avg. Disk Queue Length:
    1. Averaged 3.176 from 6:01pm - 6:24pm (Please note that I am performing a full DB backup during this time)
    2. Averaged 3.6 from 10:42pm - 10:51pm
    3. Averaged 5.48 from 12:01am - 12:13am
    4. Averaged 2.96 from 12:34am - 12:51am
    5. Averaged 5.5 from 6am - 6:38am (please note that I am performing a full DB backup during this time)

    The E: drive is a RAID5 and only has one drive per array. My Network team told me that each RAID5 only has one drive.
    Yes, the DB Data and Log files and the Temp DB and backups are all on E: (I copy a backup to another server once a day for insurance, and there is a tape backup)

  14. Luis Martin Moderator

    I insist to change to RAID 1.

    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All postings are provided “AS IS” with no warranties for accuracy.

  15. timingskey New Member

    I agree with Luis on switching to RAID 1. But it could make a big difference if you split your log, data, tempdb, and backups to separate spindles. Even then, you may need to split up the data file on separate spindles, such as putting the indexes on a separate spindle from the tables. Or if there are several tables which are hifghly active, then you may need to split put those tables in separate filegroups and then separate spindles.
  16. derrickleggett New Member

    The E: drive is a RAID5 and only has one drive per array. My Network team told me that each RAID5 only has one drive.

    This is impossible. RAID 5 requires a minimum of 3 drives. It performs optimal on 5-7 depending on the disk system.


    Other than that little oddity, I agree with everyone else here. You need more disks. Look at getting an enclosure. Try to get fiber if you can afford it. lol If you can't get an enclosure, at least buy more disks and split the load up.

    I would still like to know how many disks total that the E: drive has. I guarantee you it's not one if it's RAID 5. Also:

    1. What speed are your drives on this computer?
    2. Are you using an internal RAID controller?
    3. What type of server?

    If this is a Compaq more than a couple years old, the internal RAID controllers sucked. That's why I'm asking these kinds of questions.

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  17. rweinstein New Member

    Luis - Timing - Derrick,

    Thanks for the ongoing help here. I will ask (again) my network team all the below questions indicating some of the things you said.

    I agree with you guys that we should have more/better drives, that does seem like a weak spot.

    I'll get the info tomorrow.

    Thanks.

  18. rweinstein New Member

    Here is some more detailed information on our Disk Drives:

    The RAID5 has 4 Drives total, 3 drives and a Hot Spare, and are all 15K speed. THe controller is a Smart array 5300 and I'm not sure if it is internal.

    The server is a CPQ Proliant DL 380 G3 and is less than a year old.

    On the RAID1 mirror, 2 drives, it holds the C: drive and also a P: drive for the Page file. I didn't notice this before. The P: drive is 7GB. Is this enough? What is the best size for the page file drive?

    Also, my network team indicated that we are using RAID5 because initially is was suiting our needs and also because RAID1 is not expendable. They also said that the 140GB RAID1's are only 10K speed and not 15K.

    Is this correct? What is the largest size a RAID1 mirror can handle and is it ok for it to be 10K.

    Any other thoughts/input?

    Thanks.
  19. timingskey New Member

    I think the 7GB page file is pretty large. With 2.6GB RAM, it seems you shouldn't need that much. If so, then perhaps the server is paging too much at certain times. Above, you gave averages for a 24 hour period. But I wonder if it's level throughout that time.

    Or maybe the page file is just sitting there. That should be ok.

    If avg disk queue length on c: is ok, I wouldn't worry about the fact that it's a 140GB disk spinning at 10k. And I would think RAID 1 for that would be fine.

    I'm still having a hard time getting your e: drive to add up. The 135Gb probably does not include the space on the hot spare. If they're Compaq drives, they sell 18GB, 36GB, and 72Gb drives which spin at 15k. 3 36GB drives = 108GB.

    Anyway, since we've found out that you have 3 drives in a RAID 5 scenario, your disk queueing numbers don't seem nearly as bad. As derrick stated, more would be better. But rule of thumb for disk contention is avg disk queue length greater than 2 * the number of disks. Given that, the disk queue length numbers you stated above are not too much out of line.

    To work with what you have, even still, if space allows, I would prefer to split up the 4 disks into two sets of RAID 1. Be sure to have the data on one set, and the transaction logs on another set. Actually, we write backups to the same spindle as our transaction logs, and it works for us. tempdb is on still a separate spindle for us. Perhaps you could put that to a separate partition on the c: spindle, but that could be risky - esp. since it spins at only 10k. If it's possible to get more disks and run RAID 1 separately for tempdb, data, log, and backups, that would still be best.

    Of course there is the possibility that hardware is not the issue at all, but rather locking/blocking/deadlocks. Blocking problems can drive the CPU pretty hard. Have you examined that aspect?
  20. chopeen Member

    quote:Originally posted by rweinstein

    Processor: % Processor Time
    Server 1: Exceeded 1000% from 5:34pm - 7:35pm and 9:02pm - 5:35am (is this possible)

    This article http://support.microsoft.com/default.aspx?scid=kb;en-us;310067&sd=tech) tells about % Disk Time, but maybe there is similar problem with other counters, too.

    --

    Marek 'chopeen' Grzenkowicz, MCP
    Poland
  21. rweinstein New Member

    Timing,

    Thanks. I see now, it makes much more sense. I would say that you are correct that the numbers are more inline with the norm having 4 disks as part of the set.

    What is a spindle?

    How would I check for blocking or locking problems?

    Thanks.
  22. Luis Martin Moderator

  23. Luis Martin Moderator

    Also this sp would help:

    DROP PROCEDURE sp_rk_blocker_blockee
    GO

    CREATE PROCEDURE sp_rk_blocker_blockee
    AS

    set transaction isolation level read uncommitted
    set nocount on

    declare @blocker_spid int, @blockee_spid int, @blockee_blocker int
    declare @blockee_waitime int

    IF EXISTS
    (SELECT * FROM master.dbo.sysprocesses
    WHERE spid IN (SELECT blocked FROM master.dbo.sysprocesses))
    BEGIN


    DECLARE blocker_cursor CURSOR FOR
    SELECT spid FROM master.dbo.sysprocesses WHERE spid IN (SELECT blocked FROM master.dbo.sysprocesses)
    AND blocked=0

    DECLARE blockee_cursor CURSOR FOR
    SELECT spid, blocked, waittime FROM master.dbo.sysprocesses WHERE blocked > 0


    OPEN blocker_cursor


    FETCH NEXT FROM blocker_cursor INTO @blocker_spid
    WHILE (@@FETCH_STATUS =0 )
    BEGIN
    Select 'Blocker: ',@blocker_spid
    exec sp_executesql N'dbcc inputbuffer(@Param)',N'@Param int', @blocker_spid
    --SELECT Blocked = spid FROM master.dbo.sysprocesses WHERE blocked = @blocker_spid
    OPEN blockee_cursor
    FETCH NEXT FROM blockee_cursor INTO @blockee_spid, @blockee_blocker, @blockee_waitime

    WHILE (@@fetch_status = 0)
    BEGIN
    --SELECT Blocked = spid FROM master.dbo.sysprocesses WHERE blocked = @blocker_spid
    --Select 'EE: ', @blockee_blocker, ' Er: ',@blocker_spid
    IF (@blockee_blocker=@blocker_spid)
    BEGIN
    SELECT 'Blockee: Waittime:', @blockee_spid, @blockee_waitime
    exec sp_executesql N'dbcc inputbuffer(@Param)',N'@Param int', @blockee_spid
    END
    FETCH NEXT FROM blockee_cursor INTO @blockee_spid, @blockee_blocker, @blockee_waitime
    END
    CLOSE blockee_cursor
    FETCH NEXT FROM blocker_cursor INTO @blocker_spid
    END
    CLOSE blocker_cursor
    DEALLOCATE blockee_cursor
    DEALLOCATE blocker_cursor

    --go


    END
    ELSE
    SELECT 'No blocking processes found!'
    go

    --exec sp_rk_blocker_blockee
    --SELECT * FROM master.dbo.sysprocesses where blocked = 0
    -- dbcc inputbuffer (129)
    -- exec sp_leadblocker
    -- sp_lock 90
    --



    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All postings are provided “AS IS” with no warranties for accuracy.

  24. austinjpj New Member

    Given everything I have seen in this topic I would not say your disk are necessaryly the bottleneck.

    To determine if you have a disk bottleneck you need to look at the phyiscal disk counters in perf mon. The ones to look at are disk transfer/sec (IOPS), avg disk queue depth, avg disk sec/Transfer (latency), and disk Bytes/sec (MBPS).

    If the queue depth and the latency are both high then there is a good chance you need more spindles or need to re arrange the database (i.e. seperate data, logs, tempdb).

    A single 15K disk will do about 150-200 IOPS (8k random reads). If you have a (3+1) RAID 5 then the optimal IOPS rate would be 3 disk(spindle) * 150 (do not include parity disk) or 450 IOPS. Optimal is defined as the latency being less than 30ms, usually between 5-10ms. Anything greater than 30ms latency on the disk indicates thrashing on the disk. Latency less than or equal to 3ms indicate IO is in cache. what this really means is IO latecny greater than the average disk seek time as define by the disk manufacture is not in cache and pretty good chance it is a random IO. Sequential IO usually will be cached provided the cache is largest enough to handle the IO rate being submitted. Usually not the case in SCSI raid controllers. Now, SQLServer uses a block size of 8K which means most of the IO will be done in 8K transfer sizes (avg disk bytes transfer) thus the reason why we can use the IOPS number above based on 8K transfer sizes. We use the random read access type because most of the time databases are doing random reads and usually more reads than writes. Of course, if the writes are higher the IOPS number will be lower per disk drive. Writes simply consume more resources to satisfy an IO. You can calculate the number of spindles (disk) needed by dividing the total number of IOPS (disk transfer/sec) by 150. This will give you a good ball park figure to base line your performance. You will notice that a disk drive can do more IO, but it depends on the transfer size and the number of IO being submited simultaniously (queue depth).

    some other things to consider. If using NTFS, use the default for formating. Set the hardware stripe size on the RAID controller/RAID sets to 64K. 64K is the extent size for SQLServer. Seperate data and logs/tempdb on different disk (RAID 5 data, RAID 10/1 log/tempdb). Baseline your performance and then monitor over time (3-6 months) tracking utilization and growth. This will help you to determine when you need to buy more storage or more disk (spindles) for performance. General rule of thumb in performance is to drive the processors to 100%. an 80/20 split between user (80) and system (20) processor time. If the system processor time is higher than 20% that would indicate a bottleneck. If the system is paging (i.e. paging io is a larger percentage of all other IO) then you have to much memory allocated to SQLServer and you need to adjust the memory so the OS does not page. If the system is not paging, the cpu utilization splits are ok, the queue depth is low and disk latency is low, then the bottleneck is in the application. As said previously, locks and latches are usually the culprits and the root cause is usaully a poorly written query or poorly designed database. The where clause in queries are the key to finding this type of bottleneck.

    Hope this helps... if you have addtional questions or need more clarification post them here.

    James Jordan
    Sr. Engineering Consultant/Performance Engineer
  25. chopeen Member

    quote:Originally posted by rweinstein

    What is a spindle?

    Just a fancy word for an individual physical disk.

    --

    Marek 'chopeen' Grzenkowicz, MCP
    Poland

Share This Page