original processor performance article I wrote, with analysis of processor performance results released in the last year. The original article explains many of details discussed here. Of the new material presented here: First, the range of results published for the TPC-C benchmark now allows meaningful analysis. Second, additional results on SPEC CPU 2000 integer benchmark provide a better understanding on the influence of bus speed, cache, and memory latency.
TPC Benchmark Analysis
Table 1 below shows selected TPC-C results for 1 & 2 CPU server systems with Xeon processors. The results were selected from a single vendor (HP/Compaq) and chipset (ServerWorks GC-LE) to compare certain performance aspects. The first two results were published at the same time on the same platforms with 1 and 2 Xeon 2.8GHz processors respectively. It might seem that the dual processor system has approximately twice the performance of the single processor system. A closer look shows that the dual processor was configured with six times more memory and more than three times as many disk drives.
The first system was most probably configured for best price-performance and the second system configured more for all out performance. The TPC-C benchmark requires the data set size to be linear with the measured performance result within a certain range. Hence, the amount of memory and number of disk drives affects performance at a given performance result. A reasonable like-for-like comparison between two different performance results should have approximately equal memory and disk normalized to the performance. For example, if a 10,000 tpm-C result was measured on a system with 4G memory and 100 15K disk drives, then a comparable 20,000 tpm-C result should have in the range of 8G memory and 200 15K disk drives. (tpm-C is an abbreviation for transactions per minute on the TPC-C benchmark).
# of CPUs |
Processor Freq & Cache |
tpm-C |
$/tpm-C |
Mem (GB) |
# of Disks |
1 |
2.8GHz/512K |
19,526 |
$2.25 |
2 |
47 |
2 |
2.8GHz/512K |
39,007 |
$4.72 |
12 |
174 |
2 |
3.06GHz/512K |
44,942 |
$4.90 |
12 |
215 |
2 |
3.06GHz/1M |
52,468 |
$3.82 |
12 |
215 |
1 |
3.2GHz/1M |
33,873 |
$2.40 |
12 |
74 |
2 |
3.2GHz/1M |
54,097 |
$3.77 |
12 |
257 |
Table 1: Selected TPC-C results for 1 & 2 CPU Xeon processors.
The third and fourth system both have two 3.06GHz Xeon processors. The third system has the Northwood core with 512K L2 cache. The fourth system has the Gallatin core with 1M L3 cache in addition to the 512K L2 cache. There is a 16.7% performance increase between these two systems differing primarily in the cache size and configuration. Note that the L3 cache has longer access latency than the L2 cache, so the next generation Prescott core with 1M L2 cache should have somewhat better performance gain on the cache size alone than demonstrated between the Northwood core with 512K L2 and Gallatin core with 1M L3 cache.
The fifth and sixth results are for 1 and 2 CPU systems with 3.2GHz Xeon processors and 1M L3 cache. Unlike the first and second results, both of these systems have 12GB memory. The 1 CPU system has a lower tpm-C per memory loading (2,823 tpm-C/GB) than the 2 CPU system (4,508 tpm-C/GB), but the 1 CPU system has a higher disk loading (458 tpm-C/disk) than the 2 CPU second (210 tpm-C/disk). It is uncertain to what degree the lower memory loading offsets the higher disk loading. Here, the performance gain from 1 CPU to 2 CPUs is 1.6. It is possible that the true 1-2 CPU scaling factor is closer to 1.7 or 1.8 if the memory and disks were configured for equal tpm-C per memory and tpm-C per disk loading.
A complete explanation in detail for all the reasons performance does not scale in direct linear proportion to the number of processors is beyond the scope of a short document, or even a long article. A simple explanation is that there is contention for resources and overhead in coordinating multiple processors. A simple mathematical model for the approximate theoretical performance of a multi-processor system relative to the single processor base is as follows:
Pn / P1 = S ** (log2(n))
Pn is the performance with n processors, P1 is the performance with one processor, S is the scale factor and n in the number of processors. In the ideal situation, S = 2. In practice S is less than 2 and hopefully (but not always) greater than 1. A more transparent interpretation of the above formula is: for every doubling of the number of processors, performance increases by a factor of S.
Table 2 shows selected performance reports on systems with 2.8GHz Xeon MP/2M processors for the range between 4 and 32 CPUs. The scale factor from 4 to 8 CPUs is 1.54. Beyond 8 CPUs, the scale factor falls to 1.41. The Intel Xeon 64GB physical memory limitation as well as the overhead of using AWE memory probably contributes to the low scale factor.
# of CPUs |
Processor Freq & Cache |
tpm-C |
$/tpm-C |
Mem (GB) |
# of Disks |
4 |
IBM x445 |
90,271 |
$3.97 |
32 |
238 |
8 |
IBM x445 |
139,154 |
$5.07 |
64 |
406 |
16 |
IBM x445 |
190,510 |
$8.39 |
64 |
532 |
32 |
Unisys ES7000 |
252,920 |
$7.22 |
64 |
770 |
Table 2: Selected TPC-C results for Xeon MP 2.8GHz/2M processors.
Table 3 shows selected performance reports on systems with the Itanium 2 1.5GHz/6M processor for configurations between 4 and 64 CPUs. The scale factor for Itanium 2 is in the range of 1.60, which is considered very good. IBM has not released sufficient information to derive the scaling factor for the Power 4 processor used in the pSeries 690 systems, but it is probably is the range of 1.7, which is exceptionally high. IBM achieves this spectacular scaling in part with very wide memory busses which requires expensive multi-chip-module manufacturing techniques. This design choice significantly increases the platform cost of a 4-way system without comparable performance benefit. The performance gain in high-end systems with 16 or more CPUs is more than justified. Other factors include integrated memory controller and large off-die caches in addition to the on-die cache.
# of CPUs |
Processor Freq & Cache |
tpm-C |
$/tpm-C |
Mem (GB) |
# of Disks |
4 |
HP rx5670 |
121,065 |
$3.97 |
64 |
448 |
16 |
HP rx8620 |
301,225 |
$5.07 |
128 |
898 |
32 |
NEC Express |
577,531 |
$8.39 |
512 |
1164 |
64 |
HP Superdome |
786,646 |
$7.22 |
512 |
1792 |
Table 3: Selected TPC-C results for Itanium 2 1.5GHz/6M processors.