SQL Server Processor Performance, 2006
The shrink of the Pentium 4 from 180nm to 130nm Northwood core netted nearly a 2x gain, far above the 1.4x normally expected. The gain derives from a combination of faster core speed, larger cache, improved compilers, and higher front-side bus bandwidth. Northwood launched in January 2002 quickly reaching 3.06 GHz in November 2002 with a CPU 2000 integer result 1,099 on the Intel C/C++ version 6 compiler.
At this point in mid-2002, Intel would normally have introduced a new architecture on the 130nm process to carry on the performance progression, and discontinued attempts to further tweak the Northwood core. However, Intel’s strategy of pursuing separate processor architecture lines for desktop and mobile platforms meant that a new processor design for desktops was scheduled for the 90nm process in the late 2003 to early 2004 time frame instead of the 130nm process in mid-2002. Until the next architecture was ready, Intel managed to tweak the Northwood core for two additional speed bins to 3.4 GHz through early 2004. The MP server derivative of Northwood, Gallatin with a 2 MB L3 cache in addition to the 512 KB L2, introduced as Pentium 4 Extreme Edition, achieved SPEC CPU 2000 integer base result 1,701.
The normal Intel schedule would have had the 90nm process ready in late 2003, preferably shrinking an existing 130nm design to better guarantee intercepting the process availability point. In fact, no 90nm processor was ready until early 2004. It is unclear whether this was because no design was ready or the extra time was used to resolve unexpected issues with the 90nm process. The first 90nm processor, Prescott, was a new architecture, also unusual for the Intel pattern.
In a process shrink, it is normally possible to reduce transistor power consumption. This allows both higher frequency operation and more transistors in a general power range. However, the 90nm process had higher leakage current than expected. The result was that the Prescott core only reached 3.8 GHz due to thermal limitations, even though transistor switching speed would have supported much higher frequency operation. Figure 7 shows Northwood (130nm 512 KB L2), Gallatin (130nm 2 MB L3), Prescott (90nm 1 MB L2) and Irwindale (90nm 2 MB L2) component performance.
Figure 7: SPEC CPU 2000 Integer for Pentium 4 on 130nm and 90nm.
The Prescott core achieved a SPEC CPU 2000 integer base result of 1,666 at 3.8 GHz, only 24% over Northwood at 3.4 GHz and slightly below Gallatin. The Irwindale core at 3.8 GHz with 2 MB L2 cache was able to reach 1,833 for a 36% gain over Northwood but only 8% over Gallatin. Prescott under-performed Northwood in gzip, and showed only minor gains in crafty.
Since Prescott encompassed both a new architecture and a process shrink, this was well short of the true goal of doubling Northwood performance. It is possible to estimate its design goals for the 90nm Prescott core had it not been thermally limited. A simple shrink of the Northwood core to 90nm is expected to yield a 30% frequency increase. A full compaction should yield a 50% gain. The increase in pipeline stages from 20 in Willamette/Northwood to 30 in Prescott was probably intended to increase frequency by 50% on the same process. So there is reason to expect that the true goal of Prescott was to nearly double Northwood frequency to the neighborhood of 6 GHz at 90nm and close in on 10 GHz at 65nm, had leakage current not been an issue.
The Woodcrest core is derived mostly from the Pentium M, so it is helpful to review the Pentium M processors, Banias, Dothan, and Yonah (under the Core Duo brand). The Banias core has been described as a completely new design by some and as an improved Pentium III by others. Both Pentium II and Pentium III processors represent minor improvements to the Pentium Pro architecture, adding MMX instructions in Pentium II, SSE instructions in the Katmai core Pentium III and a significantly improved on-die L2 cache with Coppermine. There were no significant changes to the core architecture.
It is possible that Banias retained the core architecture of Pentium Pro but made significant design improvements in both performance and power efficiency. Intel documents describe performance improvements in Banias as advanced branch prediction, micro-ops fusion (decoded x86 instructions paired into single op), dedicated stack engine, and the 4x bus from Pentium 4. Dothan added: Enhanced Register Access Manager, Intelligent branch prediction – Advanced Tight Loop Execution, dual channel DDR2-533 compared with single DDR-333 for Banias. Yonah improvements: dual core shared L2 cache, SSE, integer division and the H/W pre-fetcher.
Figure 8 shows the component performance for the Pentium M 1.6 GHz with 2 MB cache on 90nm relative to Pentium III 1.4 GHz/512 KB on 130nm. Unfortunately there was not a result listed for the 130nm Pentium M, 1 MB L2 cache. Some of the performance gain is due to frequency (1.4 GHz to 1.6 GHz), cache size (512 KB to 2 MB), compiler (Intel C/C++ version 5.01 to version 9.0), memory subsystem (single SDRAM 133 MHz to dual DDR2 533 MHz) and the remainder from architectural differences between the Pentium III and Pentium M.
Figure 8: Pentium M 1.6 GHz/2 MB performance relative to Pentium III 1.4 GHz/512 KB.
The component applications vpr, gcc, mcf, bzip2, and twolf are highly sensitive to cache size, but it is clear the not all of the gains can be attributed to frequency, cache, compiler, and memory improvements. There are definitely substantial performance gains due to improvements in the design or architecture. Intel documents show 65% gain for mcf from Banias to Dothan at the same frequency.
For some curious reason, there does not appear to be any public Intel documents detailing the number pipeline stages in Banias. There is reason to believe it may have 12-14 pipeline stages, similar to the Pentium III. The top 130nm Banias frequency was 1.7 GHz. A full compaction of Coppermine to 130nm should have yielded 1.5 GHz. It is possible that Tualatin was either an assisted-shrink or that top frequency was not a pressing goal. So 1.7 GHz for a new 130nm design with 12-14 pipeline stages is very reasonable. Note that the Banias 1.7 GHz operated at 1.484v while Northwood required 1.525v to reach 3.4 GHz. The design team called Pentium M a new design instead of an improved Pentium III. It is at the least a significant improvement over the Pentium III from the performance perspective, more than enough to constituent one generation.
The 90nm Dothan only reached 2.26 GHz, but this is probably limited by the power envelope for mobile platforms rather than the true limit of the processor core. Of the two 90nm processors, Dothan 2.26 GHz operated at 1.34v while Prescott operated at 1.425v. Dothan might have reached over 2.5 GHz if not restricted by the 27w power envelope. The shrink of Dothan to 65nm might have reached as high as 3.7 GHz based on a 1.5x gain. The top actual Yonah frequency is 2.16 GHz, slightly lower than Dothan, probably to accommodate a 31w power envelope for dual cores at 1.3v, compared to 1.40v for Cedar Mill desktop processor.
On http://www.extremetech.com/, Conroe is described as a 14-stage pipeline. It is unclear whether this was inherited from the Banias/Dothan/Yonah line or a new change. Conroe is four-wide, meaning four instructions can be issued to each clock and four can be retired on each clock. Other enhancements include macro-op fusion which pairs certain x86 instructions into a single micro-op. Figure 9 shows the component performance of Pentium M 2.26 GHz/2 MB to Woodcrest 2.33 GHz/4 MB.