SQL Server Processor Performance, 2006 – Page 6

All of these problems were completely fixed in SQL Server 2005. If that were not enough, yet additional improvements in the TPC-H benchmark scores were made in SQL Server 2005 SP1. The gains in SP1 may have been due to the ability to correlate date-time columns between tables to generate a better execution plan for the TPC-H queries, as opposed to broadly applicable improvements in large query handling and parallel execution plans. In any case, there is a very strong argument for migrating data warehouse applications to SQL Server 2005 without delay.

Table 6 shows some recent TPC-H 100 GB results on SQL Server 2005 x64 Enterprise Edition. The first two systems are the HP ProLiant ML570G4 and ProLiant DL585G1, the third is the Dell PowerEdge 2900. The Opteron result is the RTM version of SQL Server 2005, build 1399. The other two results are on SP1, build 2047. There is a Dell result on the 4 Xeon 7041 that appears to the RTM (15,209 power, 8,740 throughput, and 11,529 QphH). Given that a pre-release of SP1 was available in late 2005, it is unclear why SP1 was not used. It is not certain that SP1 produces performance gains at this level.

Processors	Freq. GHz	Cache	Mem. GB	No. of Disks	Power	Throughput	QphH	Report Date
4 Xeon 7041	3.0	2×2 MB	64	124	19,497	10,405	14,243	5/22/2006
4 Opteron 880	2.4	2×1 MB	64	84	15,610	10,171	12,600	11/4/2005
2 Xeon 5160	2.2	512 KB	48	128	13,052	8,409	10,477	7/10/2006

Table 6: Recent TPC-H 100 GB results on SQL Server 2005 x64 Edition.

Figure 15 below shows the stream 0 (Power) run times for each of the TPC-H queries (lower time is better performance). The HP system Xeon 7041 has better results in all but two queries compared to the Opteron 880.

Figure 14: TPC-H 100 GB Stream 0 (Power) run times by query.

Figure 16 shows the TPC-H 100 GB average time for five simultaneous streams (used to compute the throughput score) for the above system. Here the results are mixed between the Xeon 7041 and the Opteron 880, reflecting the very close overall throughput scores.

Figure 15: TPC-H 100 GB average multi-stream run times by query.

A TPC-H 100 GB scale factor means that the line item table data size is approximately 100 GB. With other tables and all indexes, the total database size is approximately 170 GB. Notice that the many of the 100 GB results are on systems with 64 GB of memory, the same as for the 1000 GB results in the 16 processor Itanium systems. It is unfortunate that there is not a broad set of results for the x86 systems at 300 GB and 1000 GB.

Xeon and Opteron Notes

At this point, it is possible to discuss the Intel Xeon and AMD Opteron performance characteristics with respect to SQL Server. It can certainly be said that AMD made good design decisions in the K7 to K8 generations, including significant innovations in the x86 market. The original K7 was a more advanced architecture than its contemporary, the Pentium III at 250nm, with a slightly deeper pipeline for higher frequency and much improved floating point units. The 250nm Katmai was listed at 9.5 MB transistors, with 2×16 KB L1 caches probably consuming 2 MB transistors. The K7 started at 22 MB transistors, with 2×64 KB L1 caches probably consuming 8 MB transistors. The difference in logic transistor count should be about 2x.

From this point, Intel followed the path of continuing to emphasize processor core complexity (advancing the micro-architecture) to meet Moore’s Law with Willamette probably having on the order of 25 MB logic transistors (44 MB total with 256 KB cache). AMD elected to integrate the memory controller with emphasis on reducing memory latency. One Intel quick reaction design (Timna) essentially attached a north bridge to the Pentium III die, retaining a FSB in silicon with some (but not the best possible) reduction in memory latency.

The deep pipeline nature of the modern micro-processor and the large difference between core clock (<1ns) and off chip memory communication (~100ns) requires that memory accesses be made far in advance of use. One of the supposedly “brilliant” and widely used innovations in software architecture from decades ago was the linked list. This is a construct where a memory structure stores the location of the next structure. So the memory access must complete before an instruction to fetch the next location can be issued. To take full advantage of the Pentium 4 architecture, with its super fast core clock, but long memory access time relative to the core clock, it is necessary to be able to issue the next memory access before the first completes. Some of these issues can be resolved with compiler enhancements, but others would require a complete re-architecture of the software.

Continues…

Xeon and Opteron Notes

Leave a comment Cancel reply