High Call Volume SQL Server Applications on NUMA Systems
The Microsoft KB article (support.microsoft.com/default.aspx?scid=KB;EN-US;Q252867) on the Interrupt Affinity tool describes Windows 2000 as assigning interrupts to any available processor, and that performance improvement may be possible by assigning each network adapter to a specific processor. It is possible that Windows Server 2003 changed the default behavior as suggested and assigns each interrupt to specific processor. Excerpts from this KB article are in Appendix B.
Figure 4 shows the individual CPU utilization from Windows Task Manager on a Unisys 16-way Xeon MP system running Windows Server 2003 while sustaining 17K calls per second. Note that CPU 10 (counting up from 0) is at near 100% utilization. It is suspected that this is the processor handling the network interrupt, but the necessary steps to prove this were not conducted. There was no disk activity in this test. There were no other processes running and nothing else generating network activity. If this interpretation is correct, then the call handling capability of the 16-way system is saturated even though the other processors are not even close to fully loaded. An actual production server (16-way Itanium 2) running SAP exhibited essentially the same characteristics shown in Figure 4. Applying any addition network traffic to the connection handling SQL Server calls resulted in call volume performance degradation, but generating traffic on a different network connection not used by the active SQL Server clients did not degrade performance.
Figure 4 16 x 3.0GHz Xeon MP system at 17K SQL Server RPC calls/sec
It is possible that distributing the network interrupt over more processors could improve call volume performance. It could also be speculated that excluding the SQL Server process affinity from one or more processors and binding the network interrupt to excluded processor(s) might help, but the net gain is not clear. Another point to note is that the CPU cost per call on the 16-way system is much higher than that of the 4-way system. So even if the CPU load could be evenly distributed, the performance with all 16 processors saturated may be no better than the 4-way call volume performance. It could be that there is substantial cost in having one processor handle the interrupt, then hand off the call to a SQL Server thread running on a processor in a different node.
Figure 5 shows the call volume scaling characteristics on an 8-way Itanium 2 system (HP rx8620, 1.5GHz processors). There are 4 processors in each of 2 cells. The call volume test was conducted with the system booted to 1, 2, 4, and 8 processors using the NUMPROC option in the EFI OS loader (equivalent to the boot.ini file in 32-bit systems).
Figure 5 Call volume performance for HP rx8620 booted to 1, 2, 4 and 8 processors.
It was not determined in the 2 & 4 CPU test whether all processors were in a common cell. Call volume scaling shows only marginal improvement from 1 to 2 processors (13.5K to 16.5K), no gain from 2 to 4 processors, and some degradation from 4 to 8 processors. It is possible that the one or both of the 2 & 4 processor tests.