Server System Architecture, 2007 (Preliminary)
AMD Opteron Platform
The Opteron server platforms were introduced in 2003 with more major hardware vendors adopting it by the 2004 timeframe. The Opteron processor featured three 16-bit Hyper-Transport (HT) links, each of which can be configured as two 8-bit links, two DDR1 memory channels. The platform included an IO bridge from HT to PCI-X and other legacy IO devices. Dual Core Opteron processors became available in mid-2005. In early 2007, the Dual Core Opteron processors were updated to DDR2 memory.
The Opteron processor variants were determined by the number of HT links enabled. A single socket platform would only need one HT link for IO. In a two socket system, each processor requires one HT link to connect to the other processor, and the second HT link to connect to an IO bridge. In a four socket system, three HT links on each processor are required, two to connect to other processors and one for IO, as shown in Figure 6 below.
Figure 6: Opteron four socket platform with DDR2 and PCI-E (1Q 2007).
Note that two of the four Opteron processors do not connect the third HT link to an IO bridge. No vendors to my knowledge mix of two and three HT port Opteron processors in a system. The system shown above features 20 PCI-E lanes on each HT port, one HT port configures 1×8 plus 3×4 and the other HT port configures 2×8 plus 1×4 PCI-E slots. Now the downstream IO bandwidth on each HT port adds up to 5 GB/sec for the PCI-E lanes plus the bandwidth for the PCI-X and other common IO devices. Again, this is in line with the assertion that IO bandwidth is highly non-uniform temporally and it is definitely desired to over-configure the down stream bandwidth with a higher number of available slots. It is should be up to the intelligent system administrator to correctly place IO adapters in the case concurrent IO traffic is required.
There are two major elements of the Opteron processor that distinguish its performance and scaling characteristics. One is the integrated memory controller. The second is the point-to-point protocol in the HT links. In the old bus architecture, a memory request is sent over the front-side bus to a separate memory controller chip before it is issued to the actual memory silicon. The reverse path is followed for the return trip. For the last ten years, the processor core has operated at much high frequency than the bus. Note also that for the Xeon 5100 line, the FSB operates at 1333 MHz for data transfers, but the address rate is 667 MHz, compared with the top Core 2 processor frequency of 3 GHz. On a memory request, there is the delay to synchronize with the next FSB address cycle.
While transistors on recent generation process technology can switch at very high speeds, every time a signal has to be sent off-chip, there are significant delays because of the steps necessary to boost the current drive of the signal to the magnitude required for off-chip communication compared to intra-chip communication. So the Opteron integrated memory controller reduces latency on both the bus synchronization and the extra off-chip communication time.
In the multi-processor Opteron configuration shown in Figure 6, there are three possible memory access times. In the AMD NUMA notation, memory directly attach to the processor issuing the request is called a 0-hop access. Memory on an adjacent processor is called a 1-hop access. Memory on the far processor is a 2-hop access. It would seem that the non-local 1-hop memory access has the same distance as the Intel processor over FSB to Memory controller arrangement. However, the lower synchronization delay on HT compared with FSB favors Opteron.
Concerning memory performance and multi-processor scaling, the number of memory channels scales with the number of processors (sockets) in the Opteron platform. This is currently two memory channels per processor, meaning four memory channels in a two socket system and eight in a four socket system. The current Intel systems have four memory channels in both the two and four socket systems (depending on how the XMB counts). The previous generation Intel two socket system had two memory channels. So this would favor Opteron at four sockets. Most marketing material from AMD emphasizes the bandwidth of Opteron systems. For many server applications, it is the memory transaction performance that is more important, not the bandwidth. While in the Opteron case bandwidth and memory channels do scale together, it is still more correct to put emphasis on the memory channels, not the bandwidth number.
The Opteron processor features first level 64 KB Instruction and 64 KB Data caches, and a unified 1 MB L2 cache. The relatively low latency to memory allows Opteron to function well with a smaller on-die cache than comparable generation Intel processors. The Dual Core Opteron has an independent 1 MB L2 cache for each core, compared with the Intel Core 2, which has a single 4 MB L2 cache shared by the two cores. It is unclear if one of the two arrangements has a meaningful advantage over the over.
The current generation Opteron processors feature HT operating at 2 GT/sec (that is two giga-transfers per second). Since the full 16-bit HT link is 2 bytes wide, the transfer rate is 4 GB/sec in each direction. As this is a point-to-point protocol, it is possible send traffic in both directions simultaneously, making the full bandwidth 8 GB/sec per 16-bit HT link and 24 GB/sec over three HT links on each processor. I prefer to cite unidirectional bandwidth in discussing disk IO because the bulk of the transfer is frequently in one direction at any given time.
There is occasionally some confusion in HT operating frequency. At the current 2 GT/sec, my understanding is the clock is 1 GHz, with a transfer occurring on both clock edges, making for 2 GT/sec. When the Opteron platform was introduced, the operating speed may have been 1.6 GT/sec. The next generation HT to be introduced in late 2007 or early 2008 can operate at 5.2 GT/sec with backward compatibility for the lower frequencies.
Next Generation Opteron
The next generation Opteron processor, codename Barcelona, features a Quad Core die, dedicated 512 KB L2 cache for each core, a shared 2 MB L3 cache, micro-architectural enhancements over the K8, memory controller improvements, and four HT 3.0 links. The Barcelona cache arrangement is interesting. The L2 cache dedicated to each core is reduced to 512 KB from the previous generation of 1 MB, and there is now a 2 MB shared L3 cache. All L2 and L3 caches are exclusive. Any memory address can be in only one of the four dedicated L2 or the one common L3 caches. So there is effectively 4 MB of L2/L3 cache on the die spread across the five different pools. Intel did not like exclusive caches, preferring an inclusive arrangement where any memory in L2 cache must also be in L3 cache. This requires that the L3 be much larger than the L2 to be effective. The implication is that certain SKUs, for example one with a 512 KB L2 and 1 MB L3 did not really benefit from the L3.
The Barcelona information available states socket compatibility with current generation Opteron platforms. This means that memory remains two DDR-2 channels per socket at 533 and 667MHz, and the three HT links operating at 2 GT/sec. Does the current 1207 pin Socket-F actually support all four HT links? Or are the current Opteron platforms only capable of operating three HT links, and a new socket (platform) is required to support four HT links? The documents on the Hyper Transport Web site describe four and eight socket systems where each processor is directly connected to all other processors. In the four socket system, the connections use a 16-bit HT link leaving one 16-bit HT link available for IO on each socket.
In the eight socket system, processors are connected with 8-bit HT links, leaving one 8-bit HT link available for IO per socket.
The Barcelona HT links can operate at 5.2 GT/sec. So a new platform with the 5.2 GT/sec links should have better performance characteristics, especially in multi-processor scaling. It is probable that a later platform modification would have DDR3 memory channels. FB-DIMM is discussed as a possibility, but given the continued use of DDR2/3 in desktop systems, it is unlikely AMD would need to transition the Opteron platform to FB-DIMM.
The 4×16/8×8 HT configuration along with the effort to make HT 3.0 an IO protocol will allow interesting possibilities in SAN architecture. The primary emphasis on HT connected IO devices will probably be very high-end graphics and special co-processors. Both HT 3.0 and PCI-Express Generation 2 will operate at 5+ GT/s. PCI-E is an IO oriented protocol, while HT has protocols for processor-to-processor communication, and there was a much greater emphasis on low latency operation in HT compared with PCI-E. This is why the primary emphasis should be high-end graphics and co-processors. But there could be opportunities for SAN interconnects as well.
Now a SAN is really just a computer system that serves LUNs, similar to a file server that makes a file system available to other network computers. The primary connection technology is Fiber Channel. A SAN can also operate over any network protocol, Gigabit Ethernet + TCP/IP for example. The addition of iSCSI is to make this more efficient. FC however is not really up to the task of supporting high bandwidth large block sequential data operations (a table scan or even a database backup). It would be possible to directly connect a server to a SAN over HT. And an HT adapter is not even needed. The server and SAN simply connect with an HT cable. On the SAN side, there would probably be an HT to PCI-E bridge followed by PCI-E to SAS or FC adapters. So all this could be built without a special adapter, beyond HT based server and SAN systems.
Eight Socket Systems
Previous generation eight socket systems were NUMA systems, that is, Non-Uniform Memory Architecture (usually cache-coherent, or ccNUMA). Now strictly speaking, even a two socket Opteron is NUMA. The difference is that the older eight socket systems had a very large difference in memory access times from remote nodes compared to the local node. Local node memory access might be 150 ns while remote node could be over 300 ns. Opteron platforms might have memory access times of 60 ns for local, 90 ns for 1 hop, and 120 ns for 2 hop, which does not show adverse NUMA platform effects.
Most IT shops and ISVs were completely unaware of the special precautions required to architect a database application scalable on NUMA platforms. Several important major ISV applications even have severe negative scaling on NUMA systems, due to key design decisions based on experience with non-NUMA systems or even strictly on theoretical principles completely disconnected and at odds with real platform characteristics. The consequence was that most database applications behaved poorly on NUMA systems relative to a four socket SMP system, whether the DBA knew it or not. So naturally I am very curious to find out whether important Line of Business database applications will scale well on the eight socket Barcelona HT 3.0 platform.
Opteron platforms may have setting for NUMA or SUMA memory organization. I will try to discuss this in a later revision.
Server System Architecture Summary
The 2006-2007 platforms from both Intel and AMD offer major advances in processor, memory, and IO capability over otherwise previous generation platforms. Shops that operate a large number of servers should give serious consideration to replacing older platforms. Significant reduction in the number of servers, the floor space required and power consumption can be achieved. The floor space savings is especially important if it is rented from a hosting company.
Should a DBA upgrade to solve performance problems? Any serious problems in the design of a database application and the way it interacts with SQL Server should always be addressed first. After that, balance the cost and benefit of continued software performance tuning with a hardware upgrade. If you have an older (single core) NUMA system, consider stepping down to a two socket quad core or a four socket dual core, or step down when the four socket quad core is available. There is a strong likelihood that the newer smaller platform will have better performance than the older system and will not have adverse NUMA behaviors.
Are either the AMD or the Intel platforms better than the other? As of May 2007, at the two socket level, the Intel Quad-Core Xeon 5300 line on the 5000P chipset offers the best performance with good memory and IO capability. At the four socket level, performance is mixed between the Opteron 8200 and the Xeon 7100, with large query DW favoring AMD and high call volume OLTP favoring Intel. However, the Opteron will probably run many applications equal or better without special tuning skills and has better memory and IO configurations.
The next generation platform competition will begin soon. In the absence of hard information, I am going to speculate that at four sockets, Barcelona will have the advantage over the 65 nm Tigerton processor and Clarksboro chipset. This is based on the two socket quad core Xeon X5355 having comparable performance to the four socket dual core Opteron 2.8 GHz. Projecting forward, the Xeon 7300/Clarkboro combination gets the snoop filter plus 256 GB memory up from 64 GB. The Barcelona transition benefits from micro-architecture enhancements. The 45 nm Penryn processor with Seaburg chipset will have the advantage at two sockets. This is based on the architecture enhancements in Penryn and the frequency headroom of the 45 nm process. A four socket Barcelona advantage would put pressure on Intel to make the 45 nm processor available with the Clarksboro chipset sooner than later. Will AMD elevate the competition to eight socket platforms?