Server System Architecture, 2007 (Preliminary)
Other Intel 5000 Chipsets
There are several variations on the Intel 5000 chipset. The 5000Z has DIB, two memory channels, 16 PCI-E lanes and the ESI. Basically two memory channels and one of the x8 PCI-E ports are disabled from the 5000P. The 5000V has DIB, two memory channels, one x8 PCI-E and the ESI connection to the ESB. The 5000X on the other hand, has all the features of the 5000P, plus a 16 MB snoop filter, which will be discussed in the next section. It is not clear if there is really only one 5000 MCH die from which the P, X, V, and Z variations are derived by selectively disabling components.
The Snoop Filter Cache
The 5000X is targeted at workstation applications. The 5000X data sheet describes the Snoop Filter cache in the Chipset Overview section as:
One of the architectural enhancements in Intel 5000X chipset is the inclusion of a Snoop Filter to eliminate snoop traffic to the graphics port. Reduction of this traffic results in significant performance increases in graphics intensive applications.
Later in 5000X data sheet, on Functional Description chapter:
The Snoop Filter (SF) offers significant performance enhancements on several workstation benchmarks by eliminating traffic on the snooped front-side bus of the processor being snooped. By removing snoops from the snooped bus, the full bandwidth is available for other transactions. Supporting concurrent snoops effectively reduces performance degradation attributable to multiple snoop stalls. See Figure 5-1, “Snoop Filter” on page 302.
The SF is composed of two affinity groups each containing 8K sets of x16-way associative entries. The overall SF size is 16 MB in size. Each affinity group supports a pseudo-LRU replacement algorithm. Lookups are done on a full 32-way per set for hit/miss checks.
As shown in Figure 5-1the snoop filter is organized in two halves referred to as the Affinity Group 1 and Affinity Group 0 or the odd and even snoop filters respectively. As shown in Figure 5-1 Affinity Group 1 is associated with processor 1 and Affinity Group 0 is associated with processor 0. Under normal conditions a snoop is competed with a 1 snoop stall penalty. When the processors request simultaneous snoops the first snoop is completed with a one snoop stall penalty, the second snoop requires a 2 snoop stall penalty.
For the purposes of simultaneous SF access arbitration, processor 0 is given priority over processor 1. Thus simultaneous snoops are resolved with a 1 snoop stall penalty for processor 0 and a 2 snoop stall penalty for processor 2.
The SF stores the tags and coherency state information for all cache lines in the system. The SF is used to determine if a cache line associated with an address is cached in the system and where. The coherency protocol engine (CE) accesses the SF to look-up an entry, update/add an entry, or invalidate an entry in the snoop filter.
The SF has the following features:
Snoop Filter tracks total of 16 MB of cachelines (218 L2 lines).
8K sets organized as one interleave via a 2 x 16 Affinity Set-Associativity array.
There are a total of 8K x 2 x 16 = 256K Lines (218).
2 x 16 Affinity Set-Associativity will allocate/evict entries within the 16-way corresponding to the assigned affinity group if the SF look up is a miss. Each SF look up will be based on 32-way (2×16 ways) look up.
The array size of the snoop filter RAM is equivalent to 1 MB plus 0.03 MB of Pseudo-Least-Recently-Used (pLRU) RAM.
There are several additional items listed for the Snoop Filter features that not listed. Refer to the 5000X data sheet for these items.
I was somewhat confused at first in seeing some presentations describing the snoop filter as improving performance in workstation applications but not server applications. It is now clear that the more correct interpretation is that the Snoop Filter implementation in the 5000 chipset actually did significantly improve several workstation benchmarks but did not show improvement consistently for server benchmarks. From the description of next generation of chipsets, I am inclined to think that Intel believes the Snoop Filter should improve server performance and is working to that goal.
Also confusing is the Snoop Filter size. The 5000X has a 16 MB snoop filter cache. This does not say that there is a 16 MB cache in the 5000X that the snoop filter uses. Rather the Snoop Filter can support a total of 16 MB cache on the processors. The 5000X chipset supports two sockets. For the Xeon 5100 series, there is one 4 MB L2 cache in each socket, and for the Xeon 5300 series, there are two 4 MB L2 cache for each socket. So the maximum combined processor cache in the 5000X platform is 16 MB. Each cache line on all NetBurst and Core 2 processor lines is 64 Bytes. So 16 MB cache contains 256 KB cache lines, for which the Snoop Filter requires a little over 1 MB RAM to handle this.
The following information is from various slide presentations at Intel Developer Forum Spring 2007. The Snoop Filter is a cache tag structure stored in the chipset. It keeps track of the status of cache lines in the processor caches. It contains only TAGs and status of cache lines and not the data. It filters all un-necessary snoops to the remote bus. The Snoop Filter decreases the FSB utilization, forwards only requests that need to be snooped to the remote bus, a cache line that could potentially be present in a dirty state on the remote bus, cache lines that need to be invalidated. It filters all other processor snoops and large fraction of IO snoops. All IO bound applications benefit automatically. (There is a snoop filter in the E8870 SPS, the crossbar of the Itanium 2 chipset).
The Next Generation Intel Chipsets
Several next generation Intel chipsets have been described in various public forums. One is for a four socket system supporting the Core 2 architecture processors. The NetBurst and Core 2 architectures share a common FSB protocol, allowing a chipset to support both processor lines, like the 5000. The E8501 chipset can operate Core 2 processors. Now the two socket Core 2 platform showed good performance scaling from a 3.0 GHz dual core Xeon 5160 on each 1333 MHz FSB to the quad core 2.6 GHz X5355. It is very likely that two dual core Xeon 5100 processors each with a shared 4 MB L2 cache on a single 800 MHz FSB (the maximum for three loads) will not scale nearly as well, and certainly not two quad core Xeon 5300 processors. It is also uncertain that a Xeon 5160 could challenge the Xeon 7140 with 16 MB share L3 cache in the four socket E8501 platform. So while there the Core 2 and E8501 combination is possible, there is no business reason to do so. Hence, the next generation four socket Xeon platform needs to feature quad core processors to be performance competitive.
One solution would be to increase the Core 2 on-die cache to the 8-16 MB range and still operate two processor sockets sharing one 800 MHz FSB. It would however seem strange explaining to a customer that the high-end Xeon 7300 has two sockets sharing one 800 MHz FSB while the dual socket Xeon 5300 line operates a single socket per bus at 1333 and later 1600 MHz, even this scenario has been the case for several years now. Last point, the bus architecture originating with Pentium Pro specified four processors per bus, so two quad cores on one bus might not work period.
It appears that Intel is now confident in designing chipsets with the multiple independent processor bus architecture (there was a lapse of several years between the ProFusion in 1999 and the E8500 in 2005 when Intel did not have a contemporary DIB chipset). The next generation four socket solution is shown in Figure 3 below.
Figure 3: Intel Clarksboro chipset with Tigerton processors (Q3 2007).
The processor code named Tigerton will be the Xeon 7300 line. The chipset codename is Clarksboro and the platform codename is Caneland. Note the quad-independent bus (QID?) architecture. The Tigerton processor is the same 65 nm Core 2 with 4 MB shared L2 cache and two die in one package arrangement as in the Xeon 5300 line. The primary option is a quad core processor. There is a dual core option with a single die for applications that require bandwidth but not quite so much processor power. Changing a silicon die even to increase the cache size is not a light undertaking at Intel. This solution allowed the use of an existing processor silicon die while supporting a reasonable 1066 MHz FSB. It is unclear why Intel did not elect the 1333 MHz FSB that is available in the 5000P chipset. Also unclear is the 64 MB Snoop Filter. Even with four quad core Penryn, the total cache is 48 MB. Is there an undisclosed 45 nm Penryn with 8 MB L2 cache?
There are also four memory channels with 8 FB DIMMs per channel. Perhaps there is a XMB type memory controller or perhaps FB-DIMM allows eight devices on a single channel. Maximum memory configuration is 256 GB with 8 GB DIMMs in 32 sockets. The Clarksboro chipset is listed as operating with 533 and 667 MHz FB-DIMM. Unless each memory channel actually has more bandwidth than a single FB-DIMM channel, the Clarksboro chipset has the same memory bandwidth as the 5000P.
In any case, this is clearly the end of the line for FSB based multi-processor systems. The bandwidth per pin advantages of a point-to-point protocol is required to support multi-processor (socket) systems in future generations.
There are two distinct dual socket chipsets described for the upcoming Penryn processor. (Core 2 architecture on 45 nm plus so additional enhancements and a larger 6 MB shared L2 cache. Some Penryn enhancements relevant to server applications include faster OS primitive for spinlocks, interrupt masking and time stamp). The first shown below is the chipset codenamed Seaburg and platform codenamed Stoakley.
Figure 4: Intel Stoakley platform/Seaburg chipset (est. 2H 2007).
The FSB frequency is now up to 1600 MHz. The Seaburg MCH has a snoop filter for 24 MB cache, so this will support four die with 6 MB cache. Memory remains 533 and 667 MHz FB-DIMM but 800 MHz later would not be unexpected. Maximum memory will become 128 GB with 16×8 GB DIMMs. The interesting new feature is PCI-Express Generation 2. The Seaburg PCI-E IO can be configured for 44 Generation 1 lanes or 2 x 16 Generation lanes plus the connections for the ESB. PCI-E Generation 1 is the original simultaneous bi-directional 2.5 GT/sec. Generation 2 is 5.0 GT/sec. It is not clear if 2×16 PCI-E Gen 2 lanes can be configured as 8×4 Gen 2 lanes or if Gen 2 is graphics exclusively (initially).
The significant increase in both IO bandwidth and PCI-E slots is highly appreciated. Hopefully there will be powerful SAS controllers that can be matched with the PCI-E Gen 2 lanes. The preferred configuration depends on the available SAS controller. If only the old IOP 8033x, then 10 x4 PCI-E Gen 1 slots is probably best. If the new Intel IOP 8134x can drive a full x8 PCI-E port, then five x8 PCI-E Gen 1slots is a good choice. If a new IOP controller is available with PCI-E Gen 2, then x4 Gen 2 slots is the choice.
The second two socket chipset is codenamed San Clemente with platform codename Cranberry Lake. There are two DDR2 memory channels, although a later version supporting DDR3 should be expected. There is a configurable set of PCI-E ports and the ESI to connect the desktop ICH 9 south bridge. This combination is targeted at lower cost, power, and high-density two socket systems, similar to the 5000Z and 5000V.
Figure 5: Cranberry Lake platform with San Clemente chipset (est. 3Q 2007).
For unspecified reasons, it was determined that the low cost platform should use the same DDR2/3 memory solution as desktop platforms instead of FB-DIMM. Hopefully there will be a DDR3-1600 option for this platform. There were no details on the PCI-E configuration, but given the workstation interest in this platform, a two x16 PCI-E Gen 2 and other combinations are probable.