Server System Architecture, 2007 (Preliminary)
The original article in this series was published in 2002 and covered Pentium III and Pentium III Xeon systems. I had prepared an update in 2005 to cover the NetBurst (Pentium 4) based Xeon and Xeon MP systems. For some reason I never got around to publishing this article, but I did briefly mention some 2005-2006 system architectures in the “System and Storage Configuration for SQL Server” article of 2006. It is now time for an update to the Core 2 and Opteron based systems of 2007 and a look ahead to 2008. This is a preliminary article because information on upcoming systems is scarce. I will make some guesses now, and then update the article when release information becomes available. Then we can all see whether I guessed correctly or not.
The Intel architecture for Symmetric Multi-Processor computer systems dates to the early 1990s. The concept employed a common bus to connect multi-processors and the north bridge (with memory controller). This combined at the time an excellent balance between performance (multi-processor scaling) and cost of implementation (simplicity). One could argue that the old shared-bus architecture was obsolete by the early 2000s, and that the time for the high-speed point-to-point simultaneous bi-directional signaling technology in the AMD Opteron architecture had arrived. During that period, Intel was too heavily distracted by a sequence of crisis on other matters (Rambus, Timna, Itanium, no viable server chipset for the first two NetBurst generations, X64, the 90 nm process leakage current on top of the third generation NetBurst power issues to name a few) to contemplate this change. So the bus architecture was extended in Intel systems with twists delaying the transition to a point-to-point architecture until the 2008-2009 timeframe with the Nehalem processor core and Common System Interconnect (CSI).
This article will follow the Intel convention for discussing multi-core processors. What was once a processor is now a socket (even though some Intel processors in the past had fit in a slot, not socket). A processor will be called dual core (DC), quad core (QC) and so on at the socket level regardless of how many CPU cores reside on one die. A dual core processor can be two single core die in one (socket) package or a single dual-core die. A quad core socket can be two dual core dies or a single quad core die. There does not appear to be any indication of a performance difference between two single core dies in one socket package and a dual core die. So the argument that two die in package is not a true dual/quad core is just silliness with no relevance. Anyone making such an argument should support it with performance analysis. The fact that no performance argument has been made speaks for itself.
The Intel E8500/8501 Chipset
The E8500 chipset, featured Dual Independent (front-side) Bus (DIB) at 667 MHz, was introduced in early 2005. Prior to this, Intel had failed to produce a viable four socket chipset for the NetBurst processor line (Foster on 180 nm and Gallatin on 130 nm). Even the E7500/7501 for two socket Xeon systems did not garner design wins for the major OEMs. The two initial processors supported by the E8500 were the 90 nm NetBurst architectures, one with 1 MB L2 cache (Cranford) running up to 3.66 GHz, and the second with 8 MB L3 cache (Potomac) at 3.33 GHz. The next processors supported were the dual core Xeon 7000 line, a 90 nm NetBurst with 2 MB L2 cache composed of two single core die in one socket (Paxville) in 2Q 2006. This was followed by the Xeon MP 7100 line, featuring a 65 nm dual core die (Tulsa), with one L3 cache up to 16 MB shared by the two cores, in 4Q 2006. The Xeon 7000 and 7100 lines supported FSB operation at either 667 MHz or 800 MHz. The faster FSB was added in the E8501 update.
The E8500/8501 showed good performance characteristics with the singe core processors (141,504 tpm-C with 4×3.6 GHz/1 MB L2 cache). The dual core Xeon 7000 line with 2 MB L2 cache per core did not show good performance characteristics relative to the single core (188,761 tpm-C with 4x 3.0 GHz DC/2×2 MB L2 cache). The dual core Xeon 7100 with 16 MB shared L3 cache recaptured the four socket X86 TPC-C performance lead from Opteron with a very impressive result of 318,407 tpm-C, compared with 213,986 for the 4×2.6 GHz DC DDR1 Opteron and 262,989 for the 4×2.8 GHz DC DDR2 Opteron. Both Opteron systems were configured with 128 GB memory compared with 64 GB for the Xeon MP systems. The Opteron platform retained the four socket lead in TPC-H results. The big cache and Hyper-Threading (HT) make significant contributions to SMP scaling and performance in high-call volume database applications (318K tpm-C generates approximately 12,000 calls/sec), but not in very large DSS queries.
Figure 1 shows the system architecture of a four socket system built around the Intel E8501 chipset with the DIB architecture and Xeon 7100 processors.
Figure 1: Intel E8501 chipset with Xeon 7100 processor (2006).
The DIB concept is not new. This architecture was employed in the ProFusion chipset for the 8-way Pentium III Xeon architecture by a company later acquired by Intel. It was simply not possible to push the operation of a single bus shared by four processor sockets and one memory controller beyond 400 MHz of the previous generation Xeon MP platform (with ServerWorks GC-HE chipset). The E8500/8501 Memory Controller Hub (MCH) has four Intermediate Memory Interfaces (IMI) supporting a proprietary protocol (possibly a pre-cursor to FB-DIMM) instead of the native DDR2 protocol. The IMI connects to an External Memory Bridge (XMB) which splits into two DDR2 (DDR is also supported) memory channels. The XMB is described as a full memory controller, not just a memory repeater. There are a total of eight DDR2 memory channels. The maximum memory of the E8501 with DDR2 is 64 GB, so it is possible only two DDR2 DIMMs are configured on each channel.
The E8501 data sheet lists the IMI at 2.67 GB/sec outbound (write) and 5.33 GB/sec inbound (read), corresponding to the bandwidth of two DDR-333 MHz, even though DDR2-400 MHz is supported with the 800 MHz FSB. It is unclear whether this is an oversight or the actual value. Nominally, the maximum memory bandwidth is then 21 GB/sec even though 12.8 GB/sec is the limit of the combined DIB. In any case, it is the memory transaction rate that is important, not the nominal bandwidth.
The Intel 5000P Chipset and Derivatives
The current Intel two socket chipset is the 5000P introduced in 2Q 2006. The initial processor supported was the dual-core Xeon 5000 series, with two NetBurst 65 nm die in one 771-pin socket. Support for the Xeon 5100 series processor, with the new 65 nm Core 2 architecture and a single dual core die, was added only one month later. Later in 2006, support was extended to the Xeon 5300 series quad core processors, with two 65 nm Core 2 dual core die in a single package.
Figure 2: Intel 5000P chipset and Xeon 5100 processors (2Q 2006).
The 5000P has the dual independent bus architecture from the E8500/8501 chipset. That the 5000P is derived from the E8500/8501 is clear. Both have DIB, four memory channels, and 24 PCI-E lanes. Both are manufactured on the 130nm process, and in a 1432-pin package. One difference is that the long obsolete HI 1.5 interface to the south bridge for legacy devices has finally been replaced by the new ESI, introduced in desktop chipsets in 2004 as the DMI. Since there is only one processor and the MCH on each bus, it is possible to drive FSB operation up to 1333 MHz in the current generation.
A point-to-point architecture also has two electrical loads, but there are significant differences. One advantage of the Intel 5000P arrangement with two loads (one processor and one MCH) is the old bus architecture can be retained. Another advantage is that two die into one package constitutes a single electrical load, while two sockets each with a single die constitutes two electrical loads (a capacitance matter). The bus architecture supports multiple devices on one bus. This allowed Intel to simply place two single core die in one socket for a “dual-core” product without actually having to manufacture a new die with two cores, and again a “quad-core” product with two dual core die in one socket.
The advantage of a point-to-point protocol with recent (late 1990s) technology is a much higher signaling rate, and simultaneous bi-direction (SBD) transmission. Since the early 2000 timeframe, point-to-point signaling technology could support operation in the range of 2.5-3.0 GT/sec. The next generation will support 5 GT/sec. Compare this to the bus architecture, which reached 1.33 GT/sec in 2006 and is targeting 1600 in late 2007.
The 5000P has three x8 PCI-Express ports, and one Enterprise South Bridge Interface (ESI). Each x8 port can be configured as two x4 ports. The arrangement show above has six x4 ports. It is also possible to configure one x8 port + 4 x4 ports or two x8 ports and two x4 ports. Most vendors offer a mix of one or two x8 ports and four or two x4 ports. Only HP offers system based on the 5000P (ML370G5) and the E8501 (ML570G4) with six PCI-E x4 ports. This configuration provides the most PCI-E ports to drive disk IO. It is unclear whether any of the Intel 8033x IOP based SAS adapters can drive bandwidth to saturate more than a PCI-E x4 port.
The ESI is described as having extensions to the standard PCI Express specification. A curious point about the 631x ESB and 632xESB is that they connect to the MCH via not just the ESI but also an additional full x8 PCI-E port. Now on the downstream side of the ESB are two x4 PCI-E ports along with a plethora of standard IO ports. Note that the sum of the downstream ports exceeds the bandwidth of the upstream ports to the MCH. The general observation is that computer system IO traffic exhibit mostly burst operations and is highly no uniform. So it is highly unlikely that all devices are consuming bandwidth simultaneously. The upstream connection of x8 PCI-E lanes in addition to the ESI port ensures the ability to handle a combination of events on legacy IO side, and is still available to provide ports for PCI-E based traffic.