High Call Volume SQL Server Applications on NUMA Systems
The Microsoft Platform SDK documentation mentions two aspects of NUMA support. One involves scheduling threads on processors in the same cell as the memory required. The other allows a thread to allocate memory from the same node.
Figure 1 Intel E870 chipset for 16-way Itanium 2 System
High Call Volume Load Tests
A single call from an application (on a different system) to the SQL Server system may consist of one or more network packet transmissions and one or more receive packets, depending on the amount of data sent in each direction per call, the network packet size, and window size. For this reason, a call to SQL Server should not be referred to as a network round-trip. It is not possible to determine the number of round-trips from the network interface performance counters, which only tracks packet and byte count. SQL Server calls can be measured in Performance Monitor by the object: ‘SQL Server: SQL Statistics’, the counter: ‘Batch Request/sec’. In SQL Profiler, this is measured by the Stored Procedure event ‘RPC:Completed’ and the TSQL event ‘SQL:BatchCompleted’. It is implied that in order to generate a high call volume, the vast majority of calls must be relatively inexpensive and quick to execute.
To explore the characteristics of SQL Server behavior in high call volume applications, a series of tests were conducted on a range of platforms from mid-2004 to present. The test stored procedure generates a single random number, performs another arithmetic operation, and returns the SPID. There are no table operations. An otherwise equivalent stored procedure that performs a single row index seek from a table entirely in the buffer cache is only slightly more expensive than the no-op call (in the range of 15%).
The load generator system is either a 2-way Xeon or a 4 x 2.7GHz Xeon MP. All network traffic runs over a single Gigabit Ethernet network. The network controllers employed include the Broadcom BCM570x, the Intel PRO/1000MT and the Intel PRO/1000XT. There were some performance improvements from the early Windows 2000 drivers to the more recent drivers, but no significant performance differences between driver versions released after 2004 have been observed. The operating system is Windows Server 2003, with hot-fixes current at the time of the test. The SQL Server 2000 versions ranged from SP3, hot-fix build 911, 1027 and SP4.
Figure 2 RPC performance for various 2-way platforms
Figure 2 shows the calls per second performance for various 2-way platforms with between 1 – 20 client connections. The first system is a 2-way 3.06GHz Xeon, 512K cache, 533MHz FSB on the ServerWorks GC-LE chipset. The second system is a 2×3.2GHz Xeon, 1M L2 cache, 800MHz FSB on the Intel E7520 chipset. The third system is a 2×2.2GHz Opteron system. The network controller is either the Broadcom or Intel PRO/1000MT. Data is listed in Appendix A.
Figure 3 compares the 4-way 2.7GHz Xeon MP and 16-way 3.0GHz Xeon MP call volume performance. The 4-way system can drive more than 2 times more calls per second than the 16-way system. There are good examples of SQL Server queries emphasizing data intensive operations rather than network round-trips that show excellent scaling on the 16-way system. So there are no fundamental problems with NUMA systems, only the problem of matching the software stack to the hardware characteristics.
Figure 3 Client Connections