Write Speed Performance – Queue Depth Issue? | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Write Speed Performance – Queue Depth Issue?

Hello all you SQL Performance gurus out there. I would really appreciate some insights to an issue I am having with write performance. System Specs: Dell PE2850
Dual Xeon 3.2G / 8GB RAM
Dell PowerVault 220S (14 x 36GB 15K U320)
PERC 4e/DC Controllers (7 drives on each channel) I am configured in Raid 10. In IO Meter using 8KB streaming write test, I peak around 175MB/sec with 32 Concurrent IOs, which doesnt seem that amazing to begin with, but thats a whole different issue. =) The problem is whenever I attempt to test operations that are write related in SQL my performance seems horrible. The command I am using is "select * into TempTest from myTable" it has 46 million rows of strings and integer columns. Using Windows Performance Monitor, I am averaging 13.5MB/sec write speed and only 2-3 Queue Depth which is where I beleive the problem is. I really need at least 2 IOs per spindle to max out the write speed. I guess the obvious question to me, is why doesn’t SQL do more concurrent IO’s to maximize my speed? This seems like it would fix all my problems. Is there a way to adjust this or instruct SQL to do so? Thanks for any help as this has been very confusing to me! Michael
there are several things that affect your situation for SQL 2000, there is a difference between Std & Ent Ed.
in a SELECT query, at some row count between 20-30,
below which the SQL engine will issue 1 read at a time
over which, Ent Ed SQL will issue multiple IO requests
Std Ed will not issue as much the thought is that the Ent Ed was install on a high performance storage system and Std Ed was not
and the SQL team did not want to flood a poor storage system with too much IO
of course, many people installed Ent Ed on a poor storage system
such that Std Ed would appear to be more responsive to multiple users than Ent Ed. in write operations, I am not aware that SQL will use aggressive IO ops
if you search on this
you will find that people who need high write throughput do it with multiple concurrent operations in your case
your IO Meter test was a single sequential write,
in which case, you should not have needed a high queue depth to get a fairly high number, try 1,2,4,8 in your SQL test, you are doing 1 read from the source,
1 write to the data destination,
and another write for the log assuming your log is somewhere else,
you are just doing 1 read and 1 write run performance monitor, find out what the avg bytes/read and /write for just the data disk
i think you will find the performance characterstics are similar to random IO
To clarify, we are running 2005 Enterprise.
quote:
in your case
your IO Meter test was a single sequential write,
in which case, you should not have needed a high queue depth to get a fairly high number, try 1,2,4,8

Maybe I should not have needed it, but I do. I get very poor performance on write in general compared to what I expected. I max out my my read at 450MB/sec (256 Depth) but write is only 150MB/sec (32 Depth) which both seem to be bottlenecked by the controller I can only assume as the drives individually can easily do 90MB/sec both read and write. I am running latest drivers and firmware. My results are as follows: 8KB Streaming Write
Queue Depth
1 7,819 IOPs
2 10,268 IOPs
4 12,700 IOPs
8 14,644 IOPs
16 16,033 IOPs
32 18,636 IOPs
quote:
run performance monitor, find out what the avg bytes/read and /write for just the data disk
i think you will find the performance characterstics are similar to random IO

Another test I did was just to copy one large file from one place on the drive to another place on the same drive as similar test, and it performed about 3x faster than SQL. Also, my random IO is much better than 13.5MB/sec too but does require very high depth (256) to achieve. I am willing to admit that I am just clueless about something here, but would love to understand why it seems a single SATA drive can outperform 14xRaid10 15K drives in write operations such as these.
i seriously doubt your IOMeter 8K random write at low queue is going to be much better than 13MB/sec why do you say a single SATA can out perform the RAID array
definitely not across a broad range of tests there is a specific test that SATA does better by default, but it is too complicated to explain again
i wrote about it somewhere but cannot recall, must be getting senile
quote:
i seriously doubt your IOMeter 8K random write at low queue is going to be much better than 13MB/sec

Well I agree, and thats the whole problem. Is there some way to instruct SQL to increase queue depth?
no
you adjust your operations to the characteristics of SQL i asked the SQL team once for the ability to hint the queue depth in a queue they laughed,
Ok, well we are actually using SELECT INTO to create subsets of data from our main table so that the many updates do not issue table locks against the main table while we are using it. I do not know of another way to easily implement that to get better performance, but would love to hear any? Like for instance, would it be better to Create a table, and then use multiple threads to copy sections of the table over instead of using one SELECT INTO?
try writing a dot net program that inserts/updates 10-30 rows per call
run multiple threads or use Windows ACT or whatever its called these days
or a third party tool or use my scripts in the top post
they have been tested for proper performance characteristics

I did try the Update Test and scored 5MB/sec write speed which seemed really bad, but wasn’t sure what that meant in context though. I know a friend of mine tried the same update test on a SATA based RAID with same amount of 10K drives and scored 12MB/sec. That is where my conclusion about SATA being faster at lower queue depths. I got 821MB/sec on the tablescan with MAXDOP 2 though.
it is very important that you understand the difference between random and sequential IO if you are talking about sequential, use MB/sec
if you are talking about random IO, use IOPS (reads or writes /sec) ie, forget about MB/sec you cannot get 821MB/sec on 2 U320 SCSI channels,
about 520MB/sec is max at low queue, 10K SCSI/SAS is about equiv to 10K SATA in random
15K SAS/SCSI is 50% better than 10K SAS or SATA any discrepancies is probably due to variations in the configuration you did not account for
keep in mind RAID controllers are kind of tricky
there are settings that sound good, but kill performance
Sorry I forgot to mention that I am using two identical HBAs and powervaults. 28 drives total. As for the write speed his sata blows me away on sequential especially at low queue depth.
that changes alot
i was basing everything on 14 disks, 8K IO, meaning 13MB/s = 1664 IOPS run my scripts and post the results ( the full output of the select statements
i will compare with my systems at 11 disks

The 13MB/s is the speed I am getting per drive array so that # is still valid. The thing that still doesn’t add up though is why I get 3x this speed when I just copy a file on the same drive? Shouldn’t it be comparable or is it just the way SQL works? Test results using both arrays, 28 disks:
(2 x 14 x Raid10 on 2 x PERC 4e/DC): Memory – Random
55,270602,30000,935,9020.066,10,90200.67,0,0,0
55,100487,30000,938,3349.567,30,100487,0,0,0
55,36187,30000,937,1206.233,100,120623.3,0,0,0
55,3669,30000,937,122.3,1000,122300,0,0,0 Disk – Random
56,311,30046,17,10.3508,10,103.508,216.2684,1.689597,4.607571
56,185,30060,15,6.154358,20,123.0872,222.7212,1.74001,4.467363
56,968,30000,37,32.26667,30,968,1286.867,10.05365,6.578899
56,1023,30000,54,34.1,50,1705,1784.567,13.94193,7.022938
56,831,30000,81,27.7,100,2770,2722.1,21.26641,8.662111
56,427,30030,104,14.21911,300,4265.734,4070.23,31.79867,9.346408
56,167,30013,127,5.564256,999.9641,5564.056,5144.337,40.29139,9.784905
56,73,30360,157,2.40448,2999.726,7212.78,6312.022,49.43928,8.871186
56,20,31046,168,0.6442054,9998.25,6440.926,5470.882,42.85677,11.05071 Tablescan
57,1,63033,1206,4194304,32768,519.8547,2931.877,524.901,7.70102
57,1,39796,1505,4194304,32768,823.3994,1679.591,826.5773,21.93953 Update Test
65,39125,30000,751,1304.167,30,39125,0,0,1305.9,5.097786,3162000
65,475,30000,31,15.83333,30,475,824.5,6.441406,15.83333,0.06184896,8.370892
ok, it turns out my original update scripts really just collected the read from data and write to log IO
by having a separate time & IO wrapper around the CHECKPOINT i can now get the high queue write you were looking for
Note the new expanded Results2 table only the original test table C2 is not changed
everything is on the first page of the post http://www.sql-server-performance.com/forum/topic.asp?TOPIC_ID=16995
my checkpoint results
write/sec | write MB/sec | sec/IO
2580 | 134 |12.2 |
3819 | 30 | 67 |
Here they are: 73,38577,30000,899,1285.9,30,38577,0,0,648.6334,2.532585,0.02913819,638.6,0.04551623
73,0,640,5,0,0,0,7.8125,0.06103516,860.9375,156.6284,12.30214,4.6875,0.3332223
73,937,30016,165,31.21668,30,936.5005,1415.945,11.06207,4.397655,0.01712954,6.562686,26.88566,0.2342004
73,0,8326,59,0,0,0,0,0,3341.821,26.3042,63.39592,0.1201057,350.6494
73,895,30000,253,29.83333,100,2983.333,3052.3,23.84609,29.9,0.3497396,8.092466,0,0
73,0,23030,168,0,0,0,0,0,3744.681,30.11637,58.3351,0.08684324,0
73,473,30000,298,15.76667,299.9979,4729.967,4521.2,35.32188,15.83333,0.53903,8.884939,0,0
73,0,34546,230,0,0,0,0,0,3856.134,31.38642,58.7351,0.05789382,0

your numbers look a little low, but not unreasonable write into a tight area at 156MB/sec
not sequential but not random either random write at 3340-3856 IOPS and 59ms latency
that something like 250 IOPS per disk (each write goes to 2 disks) my system, 11 disks RAID 0, 10K
118MB/sec in the tight test 3800-3900 in the random write test,
that’s 350+ per disk, which is what it should be so i think you probably have RAID controller setting or diskpart alignment giving you slight degradation
your 15K disks should be faster than mine,
but the exact percentage is in one of my documents i am thinking the SATA disks without NCQ will be below 200 per disk, not sure about with NCQ
Seems to me that considering I am using 28x15K disks to your 11x10K, that my performance is way low =) What would you suggest looking at? I did not see many settings other than the usual "write-back" on the raid controller.
because I am not using RAID
and you are on RAID 1 or 10 my 11 disks 10K in a write test
should be equal to 22 10K disks in RAID 10 however 22 disks RAID 10 should in theory have 2X the read perf of my 11
but not all RAID controllers can manage it look for something that is Direct IO or Cached IO
it should be Direct IO
I see that makes a bit more sense then. And yeah, all arrays have been always set to Direct IO. I am running Raid10 on each array. (2x14xRaid10) Any other ideas? Maybe this is just the extent of the PERC 4e/DC? Are there better U320 controllers?
i would first do a test with JBOD
make 14 individual disks on each,
make a database with say 27 data files, one on each disk + 1 log file do that test
then make 7 RAID 1 disks on each Array see what those yield
just ran these and posted my results in the sticky thread
Joe, Here is why I have gotten the impression that SATA is much better than U320 on write speed: gordy’s Update results: DWritesPerSecD Write MB/sec
1463.19307.35
28998.60227.16
35002.02275.23
20020.54158.28 My Update Results: DWritesPerSecD Write MB/sec
860.9375156.6284
3341.82126.3042
3744.68130.11637
3856.13431.38642
We are both using 2 x 14 x Raid10 with 2 HBA. He is using 74GB 10K Raptors. He is getting at worst double me, and at best near 10x me. I am only showing the 0 Rows per call results. In IO Meter his streaming writes are much better than mine at peak and especially at low queue depth too. My random read does perform quite a bit better tho as it should.
Joe, No comment on this? =)
Nevermind, I just saw your response to Gordy’s post in another forum, I guess we will see when he posts his other results with longer duration, but I will say that in IOMeter sequential write tests, his setup definitely blew away mine. Is this normal? He is also able to get Peak performance on all tests with much lower queue depth. Is this just an advantage of SAS/SATA vs U320 or something wrong with my controller?
there might be configuration issues with your setup on the order ro 20-40%
but not 2-10X there are time you have to stop looking at just the numbers
and think about what is actually happening i am telling you, there is no way in hell,
any 10K hard disk (with a rotating platter) can do 2000 nonsequential IOPS,
so you just need to figure what is actually happening note that several aspects of your setup is much better than Gordys
why is it you are only unhappy on the areas where yours is less than his?
did the two of you grow up together, and you alway have the fancier toys than his? if you don’t care about the dell warranty, you could put a 1GB DIMM in your controller
i think it uses the same Intel IOP 33x controller as the SATA controller oops, never mind
the Intel SRCU42X PCI-X U320 RAID controller, which should be very similar to a PERC 4, both using LSI & Intel components
has a Intel IOP321 controller (400MHz), for which the spec says upto 1GB DDR 200MHz
the Areca is IOP332 DDR 333, anyways, there were no improved U320 SCSI RAID controllers
the Dell PERC5/E SAS controller is probably better, but at this point, wait until a SAS controller with the new Intel IOP34x line is out, Areca already has one for SATA, I don’t know why they won’t support SAS

This isnt about competition, but real world performance. Knowing my drives are capable of going much faster than the computer / controller seems to be able to drive it, is frustrating to me. I just want to make sure I am getting the best I can out of it. It makes a big difference when something that takes 10 minutes takes 5 when you are doing these operations often. Sure I can just keep piling up more drive arrays and controllers, but that doesn’t seem to be the smartest idea if other technology just does a better job. I never have understood how sequential write performance maxes at 150-175MB/sec when each drive can go 90 on its own and I can hit 475MB on reads for example. The write speed makes a big difference on many of the operations our applications perform. I would be happy if I could get a straight answer, like, yes you are getting the best you can and SATA RAID Controllers just does a better job of low queue depth operations and has higher peak writing performance. I suppose it could just be the extra cache on the controller that makes this appear that way though? P.S. My controller has the IOP 332 chip, it’s the same card as LSI MegaRaid 320-2E.
sorry, i forgot you said PERC4/E, which does have the IOP33x,
which is the same controller as the Areca 11xx line lets see,
on 2 U320 channels, i think you will get a max of 260-270MB/sec,
thats decimal MB/sec, as in million bytes/sec, not binary 1048576 bytes/sec so if everything is perfect, you should have got 498MB/sec, not 475
so something is already less than perfect in pure RAID theory, to write 100MB/sec in RAID 10,
200MB/sec is actually written to disk
so the max possible on 2 U320 is 248MB/sec it does not matter that a disk can 90,
you will only net about 35 per disk with 7 disks per SCSI channel
if you read my old articles on SCSI disk performance
you will see that i recommend 4-5 disks per channel for peak performance also why bother quoting 90 per disk
the actual disk spec is 90 (96 decimal) on the outer tracks, 58 decimal on the inner tracks
are you using a raw file system?
if not, have you noticed where windows puts your big mdf file?
even if it is the first file on a freshy formatted partition this is why i use raw partitions for peak performance testing anyways, to understand disk performance,
first you must understand the difference between sequential and non-sequential IO
that is why i said if non-sequential, don’t even bother talking about MB/sec
just talk about IOPS second, always report disk latency
if it shows zero, you are not doing non-sequential disk io,
you cannot compare the results of one test showing 0 latency,
and another at 10ms
the first is not actually doing disk io finally,
it is very difficult to get SQL to generate high write numbers
my tests were rigged to generate the higher numbers that are possible the reasons include the complex interaction between a log write,
a write to data and each index
the lack of queue depth control,
the fact that write ops are single threaded, ie, no parallelism you might have also noticed, in my main hw post
i advocate 4 racks of disks min
with careful calibration i don’t care if your db is only 100GB,
get the 4 racks and 4 controllers
you will need it to brute force through certain operations

Thank you, knowing that Raid10 writing double actually uses up half of the 320 channel bandwidth explains a lot. So do you think thats the reason that SATA does better on sequential writing than me, just more available bandwidth? And do you have any comment on the queue depth issue? As in why it seems I need such high queue depth to max my speed out compared the Areca SATA setup. This was apparent on all tests read, write, random. Thanks!
I think you missed the most important point I made on the Areca results Gordys test result showed zero latency for non-sequential writes
this means the test was not actually writing to disk you can get amazing results if you are writing to memory and not disks
writing to cache hides the disk write time,
eventually it does have to be written to disk I have seen no true sequential write results from you or Gordy
but if you did want to know
The max sequential write to 2 U320 channels in RAID 10 is ~248MB/sec The max realizable bandwidth over a PCI-X 100 bus might be around 500-600MB/sec
If the SATA controller has 1GB/s bandwidth to disk and can drive it
then the SATA or SAS controller might do sequential write at 500MB/s+ but who cares for SQL
SQL cannot do pure sequential writes at this level stop fretting, chase down the disk IO issues I mentioned if you want solid state performance, get a solid state drive
tryhttp://www.bitmicro.com/

I was referring to IO Meter tests, which I can post if you would like, that show his results peak at much lower queue depths on all tests. As for the Disk IO you mentioned, the only thing I am aware of to do at this point is try striping the drives in different configs to see if it will cause SQL to use more concurrent drive access, which in my mind is related to the above issue since this would not be necessary if my configuation performed better at lower queue depths. Was there anything else?
ok, i forgot you posted that,
i should have commented on this, but i did not regard it relevent to your real issue,
and it is still not the IOMeter small block sequential write you show is a little funny what should have happened is:
IOMeter writes 8K
the controller caches the write,
immediately acknowledging (to IOMeter) the write as complete but does the actual write on its own time
IOMeter issues another write and so on
until a sufficient backlog builds up
at this point, the controller is issuing multiple small writes which will span all disks so you should get near max performance even at low queue depth in your case, your are only getting 8K IOPS at queue depth 1
it seems like you are in write through mode
where the controller must actually write the data to disk before acknowledging to IOMeter its completion
because its only a small write, the write only hits 2 disks (both RAID 10 mirror copies)
hence the 8KB x 8K IOPS = 64MB/sec,
when queue depth increases, the controller get enough IO to issues writes to all disks in one shot, for the controller maximum in any case, none of this really matter to your SQL code,
because SQL write performance is very tricky, generating high IOMeter write performance does not necessarily help
which is what have been trying to explain to you several times

I still believe that Gordy has an advantage on sequential write speed, but after some more tests and such, I am beginning to see your point about it not mattering much when it comes to SQL.
Would you be so kind to post your IO Scripts that you have been using.
Thanks

Tracey
If you are looking for a reply from Joe, then better to contact him directly with email. If I happen to see him in PASS event in Denver then I will drop a note though [:)].

]]>