SQL Server Performance

Write Speed Performance - Queue Depth Issue?

Discussion in 'SQL Server 2005 Performance Tuning for Hardware' started by akaSurreal, Mar 7, 2007.

  1. akaSurreal New Member

    Hello all you SQL Performance gurus out there. I would really appreciate some insights to an issue I am having with write performance.

    System Specs:

    Dell PE2850
    Dual Xeon 3.2G / 8GB RAM
    Dell PowerVault 220S (14 x 36GB 15K U320)
    PERC 4e/DC Controllers (7 drives on each channel)

    I am configured in Raid 10.

    In IO Meter using 8KB streaming write test, I peak around 175MB/sec with 32 Concurrent IOs, which doesnt seem that amazing to begin with, but thats a whole different issue. =) The problem is whenever I attempt to test operations that are write related in SQL my performance seems horrible.

    The command I am using is "select * into TempTest from myTable" it has 46 million rows of strings and integer columns. Using Windows Performance Monitor, I am averaging 13.5MB/sec write speed and only 2-3 Queue Depth which is where I beleive the problem is. I really need at least 2 IOs per spindle to max out the write speed.

    I guess the obvious question to me, is why doesn't SQL do more concurrent IO's to maximize my speed? This seems like it would fix all my problems. Is there a way to adjust this or instruct SQL to do so?

    Thanks for any help as this has been very confusing to me!

  2. joechang New Member

    there are several things that affect your situation

    for SQL 2000, there is a difference between Std & Ent Ed.
    in a SELECT query, at some row count between 20-30,
    below which the SQL engine will issue 1 read at a time
    over which, Ent Ed SQL will issue multiple IO requests
    Std Ed will not issue as much

    the thought is that the Ent Ed was install on a high performance storage system and Std Ed was not
    and the SQL team did not want to flood a poor storage system with too much IO
    of course, many people installed Ent Ed on a poor storage system
    such that Std Ed would appear to be more responsive to multiple users than Ent Ed.

    in write operations, I am not aware that SQL will use aggressive IO ops
    if you search on this
    you will find that people who need high write throughput do it with multiple concurrent operations

    in your case
    your IO Meter test was a single sequential write,
    in which case, you should not have needed a high queue depth to get a fairly high number, try 1,2,4,8

    in your SQL test, you are doing 1 read from the source,
    1 write to the data destination,
    and another write for the log

    assuming your log is somewhere else,
    you are just doing 1 read and 1 write

    run performance monitor, find out what the avg bytes/read and /write for just the data disk
    i think you will find the performance characterstics are similar to random IO
  3. akaSurreal New Member

    To clarify, we are running 2005 Enterprise.

    in your case
    your IO Meter test was a single sequential write,
    in which case, you should not have needed a high queue depth to get a fairly high number, try 1,2,4,8

    Maybe I should not have needed it, but I do. I get very poor performance on write in general compared to what I expected. I max out my my read at 450MB/sec (256 Depth) but write is only 150MB/sec (32 Depth) which both seem to be bottlenecked by the controller I can only assume as the drives individually can easily do 90MB/sec both read and write. I am running latest drivers and firmware.

    My results are as follows:

    8KB Streaming Write
    Queue Depth
    1 7,819 IOPs
    2 10,268 IOPs
    4 12,700 IOPs
    8 14,644 IOPs
    16 16,033 IOPs
    32 18,636 IOPs

    run performance monitor, find out what the avg bytes/read and /write for just the data disk
    i think you will find the performance characterstics are similar to random IO

    Another test I did was just to copy one large file from one place on the drive to another place on the same drive as similar test, and it performed about 3x faster than SQL. Also, my random IO is much better than 13.5MB/sec too but does require very high depth (256) to achieve.

    I am willing to admit that I am just clueless about something here, but would love to understand why it seems a single SATA drive can outperform 14xRaid10 15K drives in write operations such as these.
  4. joechang New Member

    i seriously doubt your IOMeter 8K random write at low queue is going to be much better than 13MB/sec

    why do you say a single SATA can out perform the RAID array
    definitely not across a broad range of tests

    there is a specific test that SATA does better by default, but it is too complicated to explain again
    i wrote about it somewhere but cannot recall, must be getting senile
  5. akaSurreal New Member

    i seriously doubt your IOMeter 8K random write at low queue is going to be much better than 13MB/sec

    Well I agree, and thats the whole problem. Is there some way to instruct SQL to increase queue depth?
  6. joechang New Member

    you adjust your operations to the characteristics of SQL

    i asked the SQL team once for the ability to hint the queue depth in a queue

    they laughed,
  7. akaSurreal New Member

    Ok, well we are actually using SELECT INTO to create subsets of data from our main table so that the many updates do not issue table locks against the main table while we are using it. I do not know of another way to easily implement that to get better performance, but would love to hear any?

    Like for instance, would it be better to Create a table, and then use multiple threads to copy sections of the table over instead of using one SELECT INTO?

  8. joechang New Member

    try writing a dot net program that inserts/updates 10-30 rows per call
    run multiple threads

    or use Windows ACT or whatever its called these days
    or a third party tool

    or use my scripts in the top post
    they have been tested for proper performance characteristics
  9. akaSurreal New Member

    I did try the Update Test and scored 5MB/sec write speed which seemed really bad, but wasn't sure what that meant in context though. I know a friend of mine tried the same update test on a SATA based RAID with same amount of 10K drives and scored 12MB/sec. That is where my conclusion about SATA being faster at lower queue depths.

    I got 821MB/sec on the tablescan with MAXDOP 2 though.
  10. joechang New Member

    it is very important that you understand the difference between random and sequential IO

    if you are talking about sequential, use MB/sec
    if you are talking about random IO, use IOPS (reads or writes /sec)

    ie, forget about MB/sec

    you cannot get 821MB/sec on 2 U320 SCSI channels,
    about 520MB/sec is max

    at low queue, 10K SCSI/SAS is about equiv to 10K SATA in random
    15K SAS/SCSI is 50% better than 10K SAS or SATA

    any discrepancies is probably due to variations in the configuration you did not account for
    keep in mind RAID controllers are kind of tricky
    there are settings that sound good, but kill performance

  11. akaSurreal New Member

    Sorry I forgot to mention that I am using two identical HBAs and powervaults. 28 drives total.

    As for the write speed his sata blows me away on sequential especially at low queue depth.
  12. joechang New Member

    that changes alot
    i was basing everything on 14 disks, 8K IO, meaning 13MB/s = 1664 IOPS

    run my scripts and post the results ( the full output of the select statements
    i will compare with my systems at 11 disks
  13. akaSurreal New Member

    The 13MB/s is the speed I am getting per drive array so that # is still valid. The thing that still doesn't add up though is why I get 3x this speed when I just copy a file on the same drive? Shouldn't it be comparable or is it just the way SQL works?

    Test results using both arrays, 28 disks:
    (2 x 14 x Raid10 on 2 x PERC 4e/DC):

    Memory - Random

    Disk - Random


    Update Test

  14. joechang New Member

    ok, it turns out my original update scripts really just collected the read from data and write to log IO
    by having a separate time & IO wrapper around the CHECKPOINT

    i can now get the high queue write you were looking for
    Note the new expanded Results2 table

    only the original test table C2 is not changed
    everything is on the first page of the post


    my checkpoint results
    write/sec | write MB/sec | sec/IO
    2580 | 134 |12.2 |
    3819 | 30 | 67 |

  15. akaSurreal New Member

    Here they are:

  16. joechang New Member

    your numbers look a little low, but not unreasonable

    write into a tight area at 156MB/sec
    not sequential but not random either

    random write at 3340-3856 IOPS and 59ms latency
    that something like 250 IOPS per disk (each write goes to 2 disks)

    my system, 11 disks RAID 0, 10K
    118MB/sec in the tight test

    3800-3900 in the random write test,
    that's 350+ per disk, which is what it should be

    so i think you probably have RAID controller setting or diskpart alignment giving you slight degradation
    your 15K disks should be faster than mine,
    but the exact percentage is in one of my documents

    i am thinking the SATA disks without NCQ will be below 200 per disk, not sure about with NCQ

  17. akaSurreal New Member

    Seems to me that considering I am using 28x15K disks to your 11x10K, that my performance is way low =)

    What would you suggest looking at? I did not see many settings other than the usual "write-back" on the raid controller.
  18. joechang New Member

    because I am not using RAID
    and you are on RAID 1 or 10

    my 11 disks 10K in a write test
    should be equal to 22 10K disks in RAID 10

    however 22 disks RAID 10 should in theory have 2X the read perf of my 11
    but not all RAID controllers can manage it

    look for something that is Direct IO or Cached IO
    it should be Direct IO
  19. akaSurreal New Member

    I see that makes a bit more sense then. And yeah, all arrays have been always set to Direct IO.

    I am running Raid10 on each array. (2x14xRaid10)

    Any other ideas? Maybe this is just the extent of the PERC 4e/DC? Are there better U320 controllers?
  20. joechang New Member

    i would first do a test with JBOD
    make 14 individual disks on each,
    make a database with say 27 data files, one on each disk + 1 log file

    do that test
    then make 7 RAID 1 disks on each Array

    see what those yield
  21. gordy New Member

    just ran these and posted my results in the sticky thread
  22. akaSurreal New Member


    Here is why I have gotten the impression that SATA is much better than U320 on write speed:

    gordy's Update results:

    DWritesPerSecD Write MB/sec

    My Update Results:

    DWritesPerSecD Write MB/sec

    We are both using 2 x 14 x Raid10 with 2 HBA. He is using 74GB 10K Raptors. He is getting at worst double me, and at best near 10x me. I am only showing the 0 Rows per call results. In IO Meter his streaming writes are much better than mine at peak and especially at low queue depth too. My random read does perform quite a bit better tho as it should.
  23. akaSurreal New Member


    No comment on this? =)
  24. akaSurreal New Member

    Nevermind, I just saw your response to Gordy's post in another forum, I guess we will see when he posts his other results with longer duration, but I will say that in IOMeter sequential write tests, his setup definitely blew away mine. Is this normal?

    He is also able to get Peak performance on all tests with much lower queue depth. Is this just an advantage of SAS/SATA vs U320 or something wrong with my controller?
  25. joechang New Member

    there might be configuration issues with your setup on the order ro 20-40%
    but not 2-10X

    there are time you have to stop looking at just the numbers
    and think about what is actually happening

    i am telling you, there is no way in hell,
    any 10K hard disk (with a rotating platter) can do 2000 nonsequential IOPS,
    so you just need to figure what is actually happening

    note that several aspects of your setup is much better than Gordys
    why is it you are only unhappy on the areas where yours is less than his?
    did the two of you grow up together, and you alway have the fancier toys than his?

    if you don't care about the dell warranty, you could put a 1GB DIMM in your controller
    i think it uses the same Intel IOP 33x controller as the SATA controller

    oops, never mind
    the Intel SRCU42X PCI-X U320 RAID controller, which should be very similar to a PERC 4, both using LSI & Intel components
    has a Intel IOP321 controller (400MHz), for which the spec says upto 1GB DDR 200MHz
    the Areca is IOP332 DDR 333,

    anyways, there were no improved U320 SCSI RAID controllers
    the Dell PERC5/E SAS controller is probably better,

    but at this point, wait until a SAS controller with the new Intel IOP34x line is out, Areca already has one for SATA, I don't know why they won't support SAS

  26. akaSurreal New Member

    This isnt about competition, but real world performance. Knowing my drives are capable of going much faster than the computer / controller seems to be able to drive it, is frustrating to me. I just want to make sure I am getting the best I can out of it. It makes a big difference when something that takes 10 minutes takes 5 when you are doing these operations often. Sure I can just keep piling up more drive arrays and controllers, but that doesn't seem to be the smartest idea if other technology just does a better job. I never have understood how sequential write performance maxes at 150-175MB/sec when each drive can go 90 on its own and I can hit 475MB on reads for example.

    The write speed makes a big difference on many of the operations our applications perform. I would be happy if I could get a straight answer, like, yes you are getting the best you can and SATA RAID Controllers just does a better job of low queue depth operations and has higher peak writing performance. I suppose it could just be the extra cache on the controller that makes this appear that way though?

    P.S. My controller has the IOP 332 chip, it's the same card as LSI MegaRaid 320-2E.
  27. joechang New Member

    sorry, i forgot you said PERC4/E, which does have the IOP33x,
    which is the same controller as the Areca 11xx line

    lets see,
    on 2 U320 channels, i think you will get a max of 260-270MB/sec,
    thats decimal MB/sec, as in million bytes/sec, not binary 1048576 bytes/sec

    so if everything is perfect, you should have got 498MB/sec, not 475
    so something is already less than perfect

    in pure RAID theory, to write 100MB/sec in RAID 10,
    200MB/sec is actually written to disk
    so the max possible on 2 U320 is 248MB/sec

    it does not matter that a disk can 90,
    you will only net about 35 per disk with 7 disks per SCSI channel
    if you read my old articles on SCSI disk performance
    you will see that i recommend 4-5 disks per channel for peak performance

    also why bother quoting 90 per disk
    the actual disk spec is 90 (96 decimal) on the outer tracks, 58 decimal on the inner tracks
    are you using a raw file system?
    if not, have you noticed where windows puts your big mdf file?
    even if it is the first file on a freshy formatted partition

    this is why i use raw partitions for peak performance testing

    anyways, to understand disk performance,
    first you must understand the difference between sequential and non-sequential IO
    that is why i said if non-sequential, don't even bother talking about MB/sec
    just talk about IOPS

    second, always report disk latency
    if it shows zero, you are not doing non-sequential disk io,
    you cannot compare the results of one test showing 0 latency,
    and another at 10ms
    the first is not actually doing disk io

    it is very difficult to get SQL to generate high write numbers
    my tests were rigged to generate the higher numbers that are possible

    the reasons include the complex interaction between a log write,
    a write to data and each index
    the lack of queue depth control,
    the fact that write ops are single threaded, ie, no parallelism

    you might have also noticed, in my main hw post
    i advocate 4 racks of disks min
    with careful calibration

    i don't care if your db is only 100GB,
    get the 4 racks and 4 controllers
    you will need it to brute force through certain operations
  28. akaSurreal New Member

    Thank you, knowing that Raid10 writing double actually uses up half of the 320 channel bandwidth explains a lot.

    So do you think thats the reason that SATA does better on sequential writing than me, just more available bandwidth?

    And do you have any comment on the queue depth issue? As in why it seems I need such high queue depth to max my speed out compared the Areca SATA setup. This was apparent on all tests read, write, random.

  29. joechang New Member

    I think you missed the most important point I made on the Areca results

    Gordys test result showed zero latency for non-sequential writes
    this means the test was not actually writing to disk

    you can get amazing results if you are writing to memory and not disks
    writing to cache hides the disk write time,
    eventually it does have to be written to disk

    I have seen no true sequential write results from you or Gordy
    but if you did want to know
    The max sequential write to 2 U320 channels in RAID 10 is ~248MB/sec

    The max realizable bandwidth over a PCI-X 100 bus might be around 500-600MB/sec
    If the SATA controller has 1GB/s bandwidth to disk and can drive it
    then the SATA or SAS controller might do sequential write at 500MB/s+

    but who cares for SQL
    SQL cannot do pure sequential writes at this level

    stop fretting, chase down the disk IO issues I mentioned

    if you want solid state performance, get a solid state drive
  30. akaSurreal New Member

    I was referring to IO Meter tests, which I can post if you would like, that show his results peak at much lower queue depths on all tests.

    As for the Disk IO you mentioned, the only thing I am aware of to do at this point is try striping the drives in different configs to see if it will cause SQL to use more concurrent drive access, which in my mind is related to the above issue since this would not be necessary if my configuation performed better at lower queue depths. Was there anything else?
  31. joechang New Member

    ok, i forgot you posted that,
    i should have commented on this, but i did not regard it relevent to your real issue,
    and it is still not

    the IOMeter small block sequential write you show is a little funny

    what should have happened is:
    IOMeter writes 8K
    the controller caches the write,
    immediately acknowledging (to IOMeter) the write as complete but does the actual write on its own time
    IOMeter issues another write and so on
    until a sufficient backlog builds up
    at this point, the controller is issuing multiple small writes which will span all disks so you should get near max performance even at low queue depth

    in your case, your are only getting 8K IOPS at queue depth 1
    it seems like you are in write through mode
    where the controller must actually write the data to disk before acknowledging to IOMeter its completion
    because its only a small write, the write only hits 2 disks (both RAID 10 mirror copies)
    hence the 8KB x 8K IOPS = 64MB/sec,
    when queue depth increases, the controller get enough IO to issues writes to all disks in one shot, for the controller maximum

    in any case, none of this really matter to your SQL code,
    because SQL write performance is very tricky,

    generating high IOMeter write performance does not necessarily help
    which is what have been trying to explain to you several times

  32. akaSurreal New Member

    I still believe that Gordy has an advantage on sequential write speed, but after some more tests and such, I am beginning to see your point about it not mattering much when it comes to SQL.
  33. TRACEYSQL New Member

    Would you be so kind to post your IO Scripts that you have been using.
  34. satya Moderator

    If you are looking for a reply from Joe, then better to contact him directly with email. If I happen to see him in PASS event in Denver then I will drop a note though [:)].

Share This Page