SQL Server Performance

LiteSpeed performance notes

Discussion in 'Third Party Tools' started by joechang, Oct 14, 2006.

  1. joechang New Member

    Let me start by saying that over the last 2 years, I have provided consulting services to then Imceda and later Quest to improve performance on their LiteSpeed for SQL Server backup compression product.
    The work I have done on the coding side involved multi-threading, asynchronous operations and the file I/O API.
    Of course, performance testing and analysis was also involved.
    In the spring of 2006, I provided the following results in support of their version 4.6 launch.

    Test system:
    CPU: 4 x 3.6GHz Xeon MP, Hyper-threading enabled
    Chipset: E8500
    Memory: 16GB
    I/O: 3 Dual Channel U320 SCSI RAID Controllers
    Disks: 24 U320 SCSI 73GB, 10K

    Various disk configurations were evaluated.
    So long as the data disks could support the required read rate and the backup disks could support the required write rate, a CPU-bound test result could be achieved,
    that is, the best result possible for the given CPU and chipset combination.
    The test results for Windows Server 2003 64-bit, SQL Server 2005 32 & 64-bit are as follows for data with low, medium and high compressibility:

    Compressibility-- Backup Rate @ 4 & 8 threads
    2.7 --364MB/sec -- 514MB/sec
    4.9 --562MB/sec -- 742MB/sec
    9.9 --807MB/sec -- 971MB/sec
    The restore rates ranged from 400-500MB/sec

    It is not my purpose here to criticize competing products.
    Quest paid me to improve the performance of their product, the other companies did not.
    Of course, if anyone is interested in sponsoring an open test where each party sends a representative to verify the configuration and settings for maximum performance for their product on a given system with adequate disk performance, I will guarantee for Quest that my work is better than anybody else#%92s (a standard implicit part of my services).

    Just recently, Red-Gate commissioned the Tolly Group to publish a paper comparing the performance of their SQL Server backup compression product against that of the Idera and Quest products.

    The Tolly Group test system was a HP ProLiant ML570G3 with 2 Dual-Core Xeon 2.5GHz processors. It appears their disk configuration was 6 x 300GB 10K disks in RAID 10 for data and 3 x 300GB 10K disks in RAID 0 for backup, but this is not clear.

    My test system and Tolly Group system have different CPUs,
    but very relevent, both use the Intel E8500 or E8501 chipset.

    I have a considerable library of performance data (see my Processor Performance papers) to strongly indicate that in CPU bound software applications,
    the difference between the two systems are essentially four 3.6GHz P4/SSE3 cores versus four 2.5GHz P4/SSE3 cores.
    The compression algorithms involved are highly CPU intensive and sensitive mostly to frequency, and partly to the chipset (memory latency and bandwidth).
    Other factors such as single and dual cores, cache size are immaterial.

    So the Tolly system should be able to achieve approximately 69.4% percent of my results on LiteSpeed (2.5GHz/3.6GHz).

    With proper hardware configuration, the Tolly system should show LiteSpeed backup rates of 252MB/sec with 4 threads and 357MB/sec with 8 threads on low compressibility data and higher rates for more compressible data.
    Since the Tolly test was run at default settings (3 threads) the result should have been 189MB/sec (0.75 x 252MB/sec) or better.

    The Tolly produced test results show 153MB/sec for LiteSpeed. Some information has leaked out that the Tolly test configuration did not actually use 3 disks in RAID 0 for the backup destination with the Quest and Idera products, only for the Red-gate product. The details do not really matter. What it comes down to is their report draws conclusions based on results far off what a competent performance expert should have generated.

    Again, it is not my purpose here to advocate Quest LiteSpeed or criticize Red-gate, but I will say that for anyone interested in hiring an outside firm to produce a performance test report, it is important to find an expert that can produce top quality results.
    In any report with poor hardware configuration details, missing test details (what was CPU load during each test for instance), the flaws are easily exposed, and it will reflect on your reputation if you hire an incompetent firm or one that deliberately twists the results to a certain conclusion.
    If any one suspects this has been done to them, I will be happy (on a paid basis) to examine a test report for evidence of serious incompetence or fraud.
  2. Emma Goldstein New Member


    Interesting stuff. Do you have a link to the report so I can make my own mind up?

    - Emma
  3. James Moore New Member

    Hi Joe,

    I didn't feel your claims could go totally unchecked. Red Gate commissioned a report from the Tolly Group to see how our performance stacked up in real world settings - Quest were contacted with notice of the test by the Tolly Group and asked for feed back on how to conduct the test in a fair manner (a request to which Quest responded to quite happily by asking that Compression Level 6 be used and their results for version 4.6 be discounted and to use the results which were achieved with version 4.5).

    After the report was done, Quest sent legal threats to both Red Gate and Tolly. Quest raised no well founded objections over this period so Tolly took the decision to republish the report - with some amendments agreed to by both Quest and Red Gate.

    I am slightly concerned that as an obviously experienced contactor when it comes to performance of applications especially at the lower levels in both multi threaded and NUMA machines that you are happy to take such liberties with the statistics you are quoting.

    Firstly the claim you are making - and the only one which you are perhaps validating in any way - although see below for why I believe your conclusions are inaccurate - is that your compression code should have performed better on the Tolly's test system than it did. You set out in the second paragraph that your results require that the operations are not I/O bound in any sense. (IE you are only really measuring one part of the system). We all know that a backup operation is a pipeline simplified to the following:

    +------------+ VDI +----------+ +---------+ +-------+
    | SQL Server | -> | Compress | -> | Encrypt | -> | Write |
    +------------+ +----------+ +---------+ +-------+
    Rate: J Mb/s K Mb/s L Mb/s M Mb/s

    Simple queuing theory tells us that the throughput of the system is a function of the slowest link in the system - a prime example of this is the lengthened pipelines in modern processors so that each step does roughly the same amount of work in parallel so that the processor can be clocked at a higher rate than the old 4 step pipelines.

    You can have the fastest compression algorithms you can imagine but if your write speed isn#%92t that good then the whole system is bounded by your write speed. I believe here at Red Gate we have a good balance across the system.

    Secondly you are equating two dual code 2.5Ghz processors to four 3.6Ghz HT enabled processors by saying they are equal in all but frequency, I disagree with this assertion as to reach the speed you would expect for a 2.5Ghz processor then you multiply your throughput 0.694.

    This does not take into account the affect of your L1 or L2 Cache size, chipset, processor affinity of data or the fact that an HT enabled processor most likely can take advantage of the multiple pipelines available for int and fp processing where as a dual core cannot.

    In the next stage of you argument you argue that 3 threads should be able to perform at 75% of four threads, so we should be able to argue that 1 thread will be able to do 25% of 4 threads and 8 threads should be able to perform at 200% of 4 threads?

    In that case I would expect, on your system to achieve 504Mb/s on 8 threads where you in fact achieve 357MB/s - we both know that each thread and each process is not totally independent, there are factors such as cache invalidation, processor affinity and disk I/O bounds which affect each aspect of the backup.

    I am not quite sure where you next assertion comes from, other than way out of left field, but the same machine and disk configuration was used for each of the backup's, I would check your sources and am happy to discuss any factual errors you find in the report with the Tolly group and ask them to correct them - as we gave Quest time to do twice, once after notification and again before publishing.

    The aim of the report was to illustrate the relative performance of third party backup tools in the market today on a system you might find in a medium sized business. The Tolly group did this - they chose a reasonably common server setup, a publicly available data source (they used the data from Wikipedia) and used each of the tools to the best of their ability, other than a brief email telling them to look at multiple thread options for all of the backup solutions and on a quad core box 3-4 threads would probably perform well they were left to their own devices.

    The Tolly group work hard to be impartial and were insistent that all of the results were fair. For reference the report can be downloaded from the following URL:http://www.red-gate.com/products/SQL_Backup/tolly_report.pdf

    Many thanks,

    - James


    James Moore B.A. (Cantab)
    SQL Backup Lead Developer
    Red Gate Software Ltd
  4. joechang New Member

    What is comes down is I have a very extensive library of performance data,
    so I can make my claims with high confidence.
    In any case, the general expectation is that performance does not scale linearly with freq.
    So going from 2.5 -> 3.6GHz should yield less than 1.44X perf gain, not more.
    This would make the 2.5GHz LS performance higher, not lower
    However, I happen to know that the compression code scales nearly linear with frequency

    The 1-4 threads linear scaling on a 4 physical core system holds.
    The Xeon system in question has Hyper-threading capability,
    Scaling to 1 thread per logical, 2 threads per physical is positive but not linear

    The fact that you do not know these items, or failed to cite them, shows your knowledge is limited.

    I said best results are achieved by eliminating the IO bottleneck.
    This is fairly simple to do. The Tolly test system, when configured properly, should have had more than adequate IO bandwidth.
    The Tolly test config for LS did not

    I did not say Tolly was not impartial,
    I said either they were not impartial or they are incompetent,
    I can not make definite conclusions
  5. joechang New Member

    Some more results,
    all on a database with the tpch Lineitem table, no indexes.
    compression ratio is approx 2.7 to 1, (live data typical compresses between 3:1 and 5:1)

    2 x 3.2GHz Xeon (P4 core) 275MB/sec
    2 x 2.66GHz Xeon 5150 Dual Core (Core 2 Duo, 4 cores total) 600MB/sec

    and Tolly only managed 153MB/sec with 4 x2.5GHz Xeon (P4) cores?

Share This Page