One of the most difficult database operations to scale-up on high-end server systems is the network call. In fact, without special precautions, a high call volume database application can have severe negative scaling on large systems. It is suspected that this problem is more pronounced on NUMA systems than basic SMP systems. A NUMA system with 16 processors can have significantly lower throughput capability than an otherwise comparable 4-way SMP system. Many SQL Server performance experts have been advocating for some time the importance of consolidating calls to SQL Server because of the high cost of the call overhead relative to simple SQL operations, such as a single row index seek. There seems to be little awareness that the call overhead cost can escalate dramatically relative to the cost of SQL operations on NUMA systems with high processor count. Given the number of important applications that exhibit the high call volume characteristic, it is rather curious that there is essentially no meaningful public explanation of the special precautions (read: undocumented features) to avoid negative scaling on large systems for SQL Server 2000.
Three possible explanations for failing to consolidate calls in current generation applications include ignorance, ineptitude, and mainframe heritage. The reason the third item is exempt from the first two classifications is that mainframe systems may have been designed to sustain high call volume operations. One example is SAP, which appears to select a block of rows from one table, then retrieves one row from a given table per call. The queries do not even employ a join, as if the application architecture pre-dates relational databases.
Microsoft has announced features in SQL Server 2005 to handle the situation of high call volumes. Some of this is documented under sections describing NUMA features. This includes the ability to better utilize NUMA system memory organization, and the ability to map client connections to NUMA nodes by network port. The second item may be critical in achieving high call volume capability. There is some documentation mentioning the ability of SQL Server 2000 to optimize for NUMA memory organization, but no apparent documentation regarding client connection and CPU group affinity.
One might conclude that this capability is not in SQL Server 2000. However, it is noted that it would be very difficult for SQL Server 2000 to reach the high scores achieved in the published TPC-C benchmarks for NUMA systems with 16 or more processors. All of these systems use the VI protocol and VI capable fiber channel adapters for client connections. This by itself may help performance relative to using the TCP/IP protocol over gigabit Ethernet, but is unlikely to be sufficient. An examination of the TPC-C full disclosure reports shows configuration settings that appear to be connection affinity masks by network port. Now if this were only important for the TPC-C benchmark, then additional documentation may not be of much interest. But there are important actual customer applications that may depend on this capability, so proper documentation from Microsoft or NUMA system vendors would be very helpful.
NUMA-Based Server Systems
Current generation systems with more the four processors are frequently NUMA systems (Non-Uniform Memory Architecture). The path from processor to memory has different latencies, depending on whether the processor and memory reside in a common node or a different node. Some older 8-way Pentium III Xeon systems are not NUMA. There are three NUMA chipsets, on which the server system is based, for the Itanium 2 processor relevant to Windows environments. One is based on the Intel E8870 chipset, another from HP for the Integrity and Superdome lines, and a third chipset for NEC systems. Unisys and IBM each have a NUMA chipset for the Intel Xeon MP processors. Unisys uses the Intel E8870 chipset for their Itanium systems. Each of the above NUMA systems supports four processors per cell.
All multi-processor Opteron systems are technically NUMA, in that the processor has a direct connection to memory and to other processors. So there is a difference in latency from processor to directly attached local memory and non-local memory, requiring one or more hops to other processors. However, the other NUMA chipsets have a large difference between local and non-local memory latency, typically in the range of 120-150ns for local memory and 250-300ns for non-local memory. The Opteron 2-way system has a direct attach memory latency on the order of 60ns and a one hop non-local memory latency of 90ns (See AMD documentation for accurate values).
Going forward with trend to multi-core processors, the term processors should be changed to sockets or cores as appropriate to distinguish between the package that plugs into a motherboard, and the entity that appears as a single physical processor to the operating system.
Figure 1 shows a NUMA system based on the Intel E8870 chipset using the SPS module. Memory attaches to the DMH. The short path to local memory is from processor to the local SNC to the DMH to memory and back. The longer path to non-local memory is from processor to the local SNC to the SPS to the remote SNC to the remote DMH to memory and back.
There are some references to the way the Windows operating system organizes memory. The common organization is a simple interleave. Line 1 is spread across the 4 DMH memory hubs of the first SNC. The second line is on the second SNC and so on to the fourth SNC. Line 5 returns back to the first SNC. An alternative option is for processors located in one SNC to be able to allocate memory from the local SNC. There is a SQL Server startup flag T825 referenced as the NUMA option, but there does not appear to be public documentation on this.