How to Do SQL Server Performance Trend Analysis Part 4: Storing Performance Data in SQL Server

This is part four of a four-part tutorial. This part discusses how to interpret the most common NT Server and SQL Server Performance Monitor counters.

Introduction

In the previous three parts of this tutorial, you learned how to use Performance Monitor to collect SQL Server-related performance data, store it in SQL Server, and how to perform trend analysis on it. Now it is time to take a look at how to interpret the data you have collected.

For the purposes of this article, the Performance Monitor data can be in any form, whether it is being viewed with Performance Monitor itself, Microsoft Excel, or any other program. All you need is access to the results of your data. How you interpret your data does not depend on the format in which it is viewed.

While our main focus on this article is on SQL Server performance, we must keep in mind that the operating system (whether it is Windows NT Server 4.0 or Windows 2000) is closely intertwined with SQL Server, especially when it comes to Performance Monitor counters. In fact, when monitoring SQL Server performance, I tend to monitor more operating system performance counters than SQL Server counters. This is because many of the most important performance counters are a part of the operating system, not SQL Server. Keep this in mind as you read about how to interpret the various performance counters later in this article.

As you may know, Windows NT Server 4.0 and Widows 2000 have over 350 performance monitor counters available. SQL Server 7.0 and SQL Server 2000 both have over 100 performance monitor counters available. As might well imagine, the number of potential performance-related counters is overwhelming. Fortunately, the huge number of performance monitor counters available are seldom used. Generally, I monitor about a dozen counters on a regular basis, and only use some of the more obscure ones when researching specific performance-related problems. In this article, I will focus on the key performance counters only.

Focus on the Big Picture

My goal when I monitor SQL Server with Performance Monitor is to get the big picture, not the details. If the big picture indicates that I need to drill down for more detail, that is OK. But until then, I have better things to do with my time. The big picture allows me to see how my SQL Servers are doing, and that is why I only focus on about a dozen Performance Monitor counters, which are discussed below.

Like most people, I use Performance Monitor to help identify potential performance bottlenecks, which are usually categorized into one of these five area:

CPU: SQL Server can’t do its job if it has run out of CPU cycles, so monitoring this potential bottleneck is important.
Memory: If you want maximum SQL Server performance, your server must not have a memory bottleneck. Sure, the operating system can page if there is not enough physical RAM in the server, but do you really want to want that long?
I/O: Of all the potential performance bottlenecks, disk I/O is the probably hardest one to correct. And like lack of physical RAM, it can significantly affect SQL Server’s performance.
Network: This is generally the least bothersome bottleneck, as most servers today can hardly fill a 100Mbs network connection, no matter how hard they try.
SQL Server Specific: This includes a variety of SQL Server counters that you can monitor to help identify several potential SQL Server performance-related problems.

In the next several sections we will be taking a look at each of the above bottlenecks, and examining the various Performance Monitor counters we can use to help identify them. We will also be taking a look at how to interpret their results, and take a look at how to resolve these various bottlenecks.

CPU Performance Monitor Counters

Measuring the CPU activity of your SQL Server is a key way to identify potential CPU bottlenecks. The Process Object: % Processor Time counter is available for each CPU (instance), and measures the utilization of each individual CPU. While viewing CPU activity for each of the CPUs in your server can be useful, I generally prefer to monitor the total CPU activity for the server, using another counter, described in the next paragraph. If your server has multiple CPUs and you use the above counter to watch each one, you will notice that how busy each CPU is varies considerably. Don’t worry about this. While the operating system does its best to evenly spread the processing load over each CPU, the reality is that this is an imperfect process and different CPUs will be busier than others.

The System Object: % Total Processor Time counter measures the average of all the CPUs in your server. This is the key counter to watch for CPU utilization. If the % Total Processor Time counter exceeds 80% for continuous periods (over 10 minutes or so), then you may have a CPU bottleneck on your server. Occasional spikes of 100% are nothing to worry about and are normal for most SQL Servers.

This counter is also useful when performing trend analysis. For example, if you notice that while the % Total Processor Time counter is well within limits now, but you notice that is increasing each month, month after month, this is a good clue that you may eventually run out of CPU cycles on your server. If you recognize this problem now, then you can better plan for the future.

While the % Total Processor Time counter is important, I don’t like to rely on just a single counter to let me know if a server has a bottleneck or not. Another valuable indicator of CPU performance is the System Object: Processor Queue Length. If the Processor Queue Length exceeds 2 per CPU for continuous periods (over 10 minutes or so), then you probably have a CPU bottleneck. For example, if you have 4 CPUs in your server, the Processor Queue Length should not exceed a total of 8 for the entire server.

Use both the Processor Queue Length and the % Total Process Time counters together to determine if you have a CPU bottleneck. If both indicators are exceeding their recommended amounts during the same continuous time periods, you can be assured there is a CPU bottleneck.

If the Processor Queue Length regularly exceeds the recommended maximum, but the CPU utilization is not correspondingly as high (which is typical), then consider reducing the SQL Server “max worker threads” configuration setting. It is possible the reason that the Processor Queue Length is high is because there are an excess number of worker threads waiting to take their turn. By reducing the number of “maximum worker threads”, what you are doing is forcing thread pooling to kick in (if it hasn’t already), or to take greater advantage of thread pooling.

If your SQL Server is experiencing CPU bottlenecks, consider these possible solutions:

Get faster CPUs, or add additional CPUs.
Get CPUs with a larger L2 cache.
Tune your application so that it doesn’t have to access the disk as often. For example, add indexes so table scans aren’t needed, normalize your database to eliminate redundant data, etc.)
Tune your queries to reduce the CPU load.
Move some of the processing load to another SQL Server.
Consider turning on Windows NT fibers. [7.0, 2000 only]
Be sure that both OLTP and OLAP queries are not being run on the same server. These different database applications should be performed on separate servers.

I/O Performance Counters

If your I/O subsystem is working efficiently, then each time SQL Server wants to write or read data, it can without waiting. But if the load on the server is too great, then reads and writes will have to wait, each taking their turn. This can significantly reduce SQL Server’s performance.

The best way to monitor this is to use the PhysicalDisk Object: Avg. Disk Queue Length to monitor each disk array in your server. If the Avg. Disk Queue Length exceeds 2 for continuous periods (over 10 minutes or so) for each disk drive in an array, then you probably have an I/O bottleneck for that array. You will need to calculate this figure because Performance Monitor does not know how many physical drives are in arrays.

For example, if you have an array of 6 physical disks, and the Avg. Disk Queue Length is 10 for a particular array, then the actual Avg. Disk Queue Length for each drive is 1.66 (10/6=1.66), which is well within the recommended 2 per physical disk.

The Physical Disk Object: % Disk Time counter is another handy tool for several reasons. This counter measures how busy a physical array is (not a logical partition or individual disks in an array). It provides a good relative measure of how busy your arrays are, and over a period of time, can be used to determine if I/O needs are your server are increasing, indicating a potential need for more I/O capacity in the near future.

As a rule of thumb, the % Disk Time counter should run less than 55%. If this counter exceeds 55% for continuous periods (over 10 minutes or so), then your SQL Server may be experiencing an I/O bottleneck. If you suspect a physical disk bottleneck, you may also want to monitor the Physical Disk Object: % Disk Read Time counter and the Physical Disk Object: % Disk Write Time counter in order to help determine if the I/O bottleneck is being mostly caused by reads or writes.

Also, this counter is a good indicator of how busy each array on your server is. By monitoring each array, you can tell how well balanced your I/O is over each array. Ideally, you want to distribute the I/O load of SQL Server as evenly as possible over your arrays, and this counter will tell you how successful you have been doing this. If you find that one array is much busier than another array, you may want to consider moving one or more files from the busy array to the less busy array.

Before using these counters for Window NT Server 4.0, be sure to manually turn it on by going to the NT Command Prompt and entering the following: “diskperf -y”, and then rebooting your server. This is required to turn on the disk counters on for the first time.

If your SQL Server is experiencing I/O bottlenecks, consider these possible solutions:

Add additional physical RAM so that your server can go to RAM instead of the I/O system to access data.
If you are not already, use RAID level 5 or RAID level 10 for your arrays. RAID level 10 is the fastest RAID level you can choose that supports redundancy.
Add more physical drives to the current arrays. This helps to boost both read and write access times. But don’t add more drives to the array than your I/O controller can support.
Replace your current hard drives with faster drives.
Add faster or additional I/O controllers. Consider adding more cache (of possible) to your current controllers.
Tune your application so that it doesn’t have to access the disk as often. For example, add indexes so table scans aren’t needed, normalize your database to eliminate redundant data, etc.)
Move database or transaction log files from busy arrays to less busy arrays.
Store your databases and transaction log files on a SAN (storage area network).
Used partitioned views and federated servers to distribute workload (2000 only).

Continues…

Leave a comment Cancel reply