So, you are the Database Administrator in your company and your boss wants you to “have a look” at a SQL Server that you would be managing from now on. You didn’t know about it before, somebody else had installed it before you came on board, but now your work is to make sure it’s running as smoothly as possible.
This is a fairly common scenarios in a DBA’s work life, but you may be wondering about your next step. Where do you start? What do you do with the server? How do you find out what’s good and bad about it?
Understanding the Server and Environment
The first thing you should do is to perform an “audit” on the server. By “audit”, I mean “taking a good look” at the server and SQL Server’s health – like your boss said. This is a necessary step, in fact the first step in successfully managing the system. Establish a baseline of the server and it becomes clear if you need to do anything.
Purpose of the Server
First of all, try to understand the purpose of the server. What application data is it hosting? Is it a SharePoint back-end for a Project Management application? Is it a stand-by server used for high availability? Is it a stand-alone reporting server for month-end reporting or is it a big, beefy OLTP sever thousands of people are accessing every day? Knowing your server starts with the step of knowing its purpose.
Did the server go through an “integration” process before it was handed over to you? By “integration process” I mean a step-by-step method where the server’s maintenance requirements have been agreed upon. Did your DBA team or the IT department undertake the maintenance task of the server as part of a formal handover process, or has it been dumped on you out of the blue? You need to know this because if it has been formally handed to you or your team, other people are also responsible for its maintenance and you have helping hands available. If not, you are the only one who knows about it and you may need to get other teams involved for its future maintenance.
This principle still holds if you are a paid external consultant doing an audit for a client. You need to let your client know they have a server that needs multiple teams’ involvement.
Production vs. Non-production System
Is it a production box or a non-production system? Non-production systems can include development, testing, QA or training servers.
Why would you want to find out this? Well, all your production servers could all be following an SOE (Standard Operating Environment). This can include operating system, patch level, disk layout, memory etc. If this new system is something different, you would probably want to take note and try to standardize it later.
How important is the server? Is it a mission-critical 24×7 system? Or is this something your company can live without for a few hours or few days? Is there any SLA (Service Level Agreements) attached with it? What’s the SLA?
Is there any governance plan or responsibility matrix defined for it? Is there an escalation process defined for incidents or problems – who do you contact when there is a crisis?
Business and Application Owners
Who owns the server? In other words, who’s your client? Who pays for it? In management terms, who is the “business owner” of the server? This can be a team internal to your organization (e.g. finance, HR, design, production, marketing) or this can be an individual – like a client rep from a company. In most cases you will be dealing with one or two individuals from your client side. Whoever it is, you need to know your contact. You need to build a professional relationship with them from the very beginning.
Why? Because when the time comes, it will be them you would be answering to. It’s them you need to talk to when your server has crashed and you are trying to recover it. It’s them you need to make aware about the risks of no applying the latest service pack.
Knowing the Technical Teams
Know the reps from all the other teams who could be “managing” this server. Typically, this would be a coordinated effort of various roles and teams within IT. Infrastructure or Hosting Services would be looking after the server’s physical health, SAN storage, virtualisation, general server administration and so on. Networking teams would be inevitably involved in configuring access points like load balancers and accelerators. Firewalls, antivirus and security access monitoring would be in the realms of network security team. The applications team could very well have one or two senior developers acting as de-facto administrators of the system.
I find it essential that every DBA grows a friendly and professional relationship with his colleagues from other teams. That’s because it’s them you would be taking some of your demands to and it’s you they would need to consult before making any changes to the server. You could be sharing the same office floor or sitting in opposite sides of the globe but the same principle applies.
Technical Audit: Overall Server Health
Now we start looking at the technical side of the server. In the following sections I have listed a number of health checks that you should consider. Some of these are SQL Server-based; others are from a Windows perspective. For some of the audits you should have full admin access to the system. If you don’t have that access, you need to talk to the server administrators.
Accessing the System
How do you access the DB server? Is it in the same subnet or a different one? Can you access it by name or do you have to use an IP address? Can you access it directly from your desktop with Management Studio or do you have to remotely log in? What type of remote connection should you use: Windows RDP or other tools like VNC? What account do you use: your Windows credentials or a dedicated account? Does your DBA account have full admin access to the server and does it have sysadmin rights to the SQL instance?
Physical vs. Virtual Servers
Is the machine a physical server or a VM? Or is this a clustered environment?
If it’s a physical machine, you may want to know if it has some sort of protection against hardware failure like cluster, Always-On or other DR mechanism. Is the hardware under manufacturer warranty or is it running with ageing parts that can fail any time?
If this is a clustered environment, check the failover cluster administrator applet; see what resources are clustered, check if there are any errors logged.
If this is a virtual machine, see if it has been assigned enough capacity and processing power.
You can find this kind of information with the help of your server infrastructure team. That’s why it makes sense to be in good terms with them.
How are the actual physical resources looking like? Is the CPU under continuous load (85% and above)? Is the memory under continuous pressure with very little free amount? How much RAM is installed? Is it something low like 2GB or is it in somewhat respectable region like 8 GB or more? How much disk space is available? Is there any drive running out of space with less than 10% free?
Check the Local Administrators group in Windows. Does it have only generic service accounts or does it include other people too? Is there any AD group listed there?
Generally, only a handful of accounts should have local administrator rights to the server. If you think there are more accounts listed than there should be, take note. You need to talk to your server administrators about it.
Third-Party Tools and Apps
Is there any application or server installed in this machine other than SQL? What are those applications? Ideally a DB server should not have any other applications installed in it. If there are, you need to know why they have been installed there.
Is there any shared folder in any drive? What accounts have permissions to those shared folders? What accounts have security access there? It’s not uncommon to see the “Everyone” group having full access to shared folders. If that’s the case here, you need to take note and discuss with your app owners and server administrators. This is a security risk.
Is there any 32-bit or 64-bit ODBC connection configured in the machine? What data sources do they point to? Can you test the ODBC connectivity successfully?
Is there any Windows scheduled tasks in this server? Are these jobs enabled? Are they succeeding? If they are failing, check any output log files or event log to investigate the root cause: perhaps it’s because a network folder can’t be reached or a .BAT file is missing?
Antivirus Exclusions Rules
Is the server running any anti-virus software (e.g. McAfee)? What are the scanning rules? Database files are usually always open and AV software is typically configured to exclude them from scanning. Exception to this can cause data files being locked for extended periods of time, resulting in significant I/O delays and high CPU usage. Check if typical database files like .mdf, .ndf, .ldf, .trn., and bak are excluded from virus scanning.
Is the server monitored? Are there any triggers and alerts for trapping critical system events? What events are monitored? What monitoring tool is used and how is it configured? What’s the method of notifying server administrators (e-mail or text message)? Are you going to be notified as a DBA when SQL Service stops unexpectedly or a disk drive runs out of space?
Technical Audit: SQL Server Health
Now, with the general server audit over and done with, let’s start with the SQL part.
SQL Server Components Installed
What components of SQL Server suite is installed in the machine? Is it only SQL Service or is it Analysis Service or Reporting Service as well? Is there any other sub-component like Master Data Services running?
Sometimes you will find systems where SSAS or SSRS has been installed but doesn’t host any databases or reports. You need to find out if the unused services can be stopped and disabled.
Check if there is more than one SQL instance running in the machine. Is any instance stopped or disabled?
SQL Server Version, Edition and Service Pack
What’s the version and edition of SQL Server running in this machine? Is it running an out-of-support version? May be you can recommend upgrading it later. What service pack is installed here? Is it still under Microsoft mainstream support? If not, why can’t the service pack be upgraded?
What service account is SQL Server running under? Has someone left it with the default Local System account (or worse, someone’s Windows AD account) or is it running under a dedicated service account? Take note if it’s not running under a service account because this is not a good practice.
SQL Server Configuration Properties
Check the SQL Server configuration parameters and different facets. You can find the information from SQL Server Properties window, command output of sp_configure or the Configuration Manager applet.
Is the maximum server memory assigned a specific value or is it accessing the entire available RAM? If it’s accessing all the available memory, you need to take note and come back later to change the default value.
Is xp_cmdshell enabled? In most cases this is considered a major security loophole; in your server instance this could be necessary for legacy code to run.
Is Remote DAC enabled? You may want to enable it later so you have a backdoor to connect to the server when it’s unresponsive.
Is OLE automation enabled? What’s the fill factor? How about Max Degree of Parallelism (MAXDOP)? If MAXDOP has a value other than zero, you need to find why it has been changed from the default setting.
What about the network protocols for SQL Server? Is Named Pipes enabled for this instance or is it only TCP/IP? What’s the port number for SQL? For a default instance, is this running under port 1433 or is there a custom port defined? Is there any alias defined for client side? What are those aliases?
Is there any start-up trace flags defined for SQL Service? You can find this from the Configuration Manager’s service property.
Error Logs and Event Logs
Have a look at the SQL Server error log. Look through at least last seven days’ data. Is there any error or warning repeatedly being posted? Backup failures, databases running out of space, SQL service stopping unexpectedly, I/O delays taking more than fifteen seconds, deadlocks, tempdb filling up and other warnings should be noted: These are probably the first things you need to fix in this server.
Similarly, have a look at the SQL Server Agent log. Look for warnings and critical errors.
And while you are checking the error logs, see if they are configured properly. Ideally, an error log should cycle (start a new log) every day and there should be at least seven days’ logs present. Is the server configured to refresh the logs?
Next, check the Windows event logs: namely the Application, System and Security logs. Is there any critical error happening in recent times?
Database File Locations
As a DBA, you should follow a consistent layout for your database servers’ disks. Personally, I use the following convention:
For operating system files and virtual memory space, use C:\ drive. Use D:\ drive for SQL Server and other application binaries (Program Files and Program Files x86)). This is also used for any shared data folders. I designate E:\ drive for only database data files and L:\ drive for only transaction log files. F:\ drive is used for database backup files and finally T:\ drive is dedicated to tempdb data and log files.
Now the separation of data and log files may seem to be a thing of the past in recent days as more and more servers are virtualised and shared storage is used to host data. I still follow this layout for two reasons.
First, it keeps things simpler and neater. You know where your files are. Secondly, there is less chance of disks running out of space: your data files are not competing for space with the backup files and your tempdb has its own dedicated space to grow.
However, it’s not uncommon to see SQL installations where binaries and data files have all been placed under C:\. To mitigate this, you can potentially move the databases to a new drive and also change the default data and log file paths for SQL Server.
Before running any checks on the databases, have an initial look: is there any database in suspect, shutdown, offline, single user or read-only mode? If there is, why is it so? Check with the applications team.
Next, find the recovery modes of the databases. Is there any database in simple recovery mode? Why so?
Database File Properties
What’s the data and log file growth property for each database? Is the data or log restricted to a predefined size or are they allowed growing indefinitely? How much do they grow by?
By default, data files grow by 1 MB and log files by 10% of their size. I always change this because it can cause two types of problems. The first one is related to fragmentation and I/O delays. Every time the data file needs to grow the database storage engine will need to request 1 MB space from the operating system. Any I/O operation will then have to wait till the storage has been assigned. For a busy system the data file could be expanding quite frequently as inserts and updates are happening. Collectively all these micro expansions would create a lot of back-and-forth operations between SQL and Windows, resulting in index and data page fragmentation.
The second problem is with log files growing by a percentage. As an example, a 4 GB log file would grow by 400 MB when an expansion happens. That’s almost half a GB disk space gone in one swoop. The larger the log file, the larger would be the space assigned.
So what do you do? As a rule of thumb, log files should be one fourth or half the size of data files. So first of all, make sure the log files have been shrunk if they have a lot of empty space available. You can easily find this information from the database properties and Management Studio reports. Next, change the file growth properties of both types of files. For example, data files can grow by 512, 256 or 128 MB while log files can grow by say, 128 or 64 MB. This way, you know the files will grow by a definite number and not by some arbitrary value.
Other Database Properties
What’s the compatibility mode of each database? Is it the latest version of SQL or is it from an older version? This can happen if the database has been upgraded to a newer version.
How are the statistics maintained? Are the auto-create and auto-update statistics properties enabled for each database?
And what’s the page verify option? Is it set to checksum or torn page detection? Again, the latter can happen if the database has been upgraded from a previous version.
Is service broker enabled for any of the databases? Is partitioning being used? Has Transparent Data Encryption (TDE) been enabled? Run queries to find the relevant information from sys.databases or sys.dm_database_encryption_keys views.
And while you are at it, check if any database has a full text catalogue defined. If there is, look for schedule jobs that update it periodically.
Now check if the databases are backed up regularly. When was the last time they were fully backed up? When was the last time their transaction logs were backed up? If the answer to either of these questions is “never”, you have a problem at hand: you need to start backing up the databases.
Look for maintenance plans or customised jobs that back up the databases. Check if these jobs are enabled and succeeding. If there are no SQL jobs but databases are still showing recent backup dates then it could be happening through a third-party tool. Check with your application owners or server administrators.
Check the tempdb database. How many data files has it got? Is the number of data files same as the number of logical CPUs shown in Task Manager? If not, the tempdb is not optimally configured for I/O. Check the data file sizes. Are they all of the same size? Are they big enough? It’s not uncommon to see a high-transaction OLTP system having a 4 GB tempdb database. Increase the data file sizes and the number of data files as necessary.
Master Database Objects
Check the master database for non-system, user-generated objects like tables, stored procedures or functions. This is bad programming practice but unfortunately many shops have them and you can’t do much to change because clearing the master database would mean significant changes in application. All you can do is to talk to your application team so they stop doing this and consider re-architecting the application.
Now it’s time to check the physical health of the databases. Run the DBCC CHECKDB command against each of the databases and watch for any error messages. Sometimes the command will fail when the database has been upgraded from a previous version. Run the DBCC UPDATEUSAGE command and then run CHECKDB again.
If there are databases failing CHECKDB with a minimum option of REPAIR_ALLOW_DATA_LOSS, you need to talk to your application owners. The database has physical consistency errors and you need to understand what your client wants you to do. You can restore the database from a recent good backup if one is available.
Check the index fragmentation in each database. You can run a query involving dynamic performance function (sys.dm_db_index_physical_stats) and look through the average fragmentation in percent column for values above 30. Alternatively, run a Management Studio Report called Index Physical Statistics. If there are tables with more than 30% fragmentation, you need to consider rebuilding those indexes.
Database Maintenance Plans can automate some DBA tasks like auto-updating statistics or shrinking database files. Check the instance and see if Maintenance Plans have been created. Mainly you would want to see if scheduled jobs are shirking database files, updating statistics, reorganizing and rebuilding indexes, running database consistency checks and so on. If there are no such jobs, take note; maintenance plans need to be created. Similarly, investigate the root cause if the jobs are present but failing.
SQL Server Agent Jobs
Jobs are integral part of many SQL Server applications. They can perform scheduled repetitive tasks and implement custom business logic. We talked about Maintenance Plan jobs in the previous section; in some systems these are all you would find. Other systems will have dozens of custom jobs running in all different schedules. As a DBA, you don’t need to know what these jobs are doing; you need to know two things only.
Firstly, what jobs are failing and what’s the root cause of the failure? Is it because a database or stored procedure is missing? Or is it because the job’s owner doesn’t have sufficient privileges or is it due to a primary key violation? How long the job’s been failing? If it’s been failing for a long time and nobody has reported any error, ask if the job can be disabled.
Secondly, check the jobs’ owners. Are these people’s Windows AD accounts? There is a potential problem here. People’s AD accounts can have full rights in the network and the job would successfully run as long as the account is active. But once the person creating the job leaves the company, his account would probably be deactivated, disabled or deleted. The job would then start to fail. That’s why it’s best to run jobs under SQL Server Agent service account or another application-specific system account: accounts that don’t get deleted or locked out.
As you finish your audit of the jobs, take a quick look and see if any operators, proxies or credentials have been defined for the server.
What linked servers have been created in this SQL instance? Are they working? Can you access the remote data sources by expanding the “linked servers” node? What security credentials are they using?
Is database mail enabled and configured in this server? Check the database mail log for errors. Send a test e-mail from SQL to your own e-mail address to check the functionality.
Is the server configured for high availability?
We touched the topic of checking Windows clustering error messages from the Failover Cluster Administrator applet. What about Always-On? Is there an availability group defined? Check the status and see if the node is healthy.
Similarly, check the status of log shipping or mirroring if they have been configured. You can run a Management Studio report for transaction log shipping status. Successfully mirrored databases will have a status of “Synchronized” or “Synchronizing” written beside the database name; you can also use the database Mirroring Monitor for this purpose.
Is the server replicating data with another system? Is it a publisher or a subscriber? Launch the Replication Monitor and see if any replications are failing or have expired. You need to reinitialize failed or expired replications.
Other SQL Server Technologies
There can be a number of newer technologies enabled in the instance. We will briefly touch on those.
Is Resource Governor enabled? What Resource Pools have been defined? Can you manually evaluate them and see the results?
Similarly, is Policy-based Management enabled? Is there any customized, non-system specific policy defined? Can you evaluate them?
Is there any server-based audits defined? What are the specifications for those audits?
Has Management Data Warehouse (MDW) been enabled? Are the Data Collectors running? Are the SQL Server jobs collecting data for MDW running successfully or are they failing? Can you connect to the MDW and view reports?
This is a topic that can take an entire article by itself. Here are few things you can check in your SQL Server.
First of all, who are the members of the sysadmin server role? Accounts under this role have full system-wide privileges and naturally you want to restrict that. Ideally, only DBA accounts and perhaps some application service accounts should be members of this role. If that’s not the case, take note. You need to talk to your app owners about potential security threat and plan for hardening security.
Similarly, if you have Reporting Service or Analysis Service running in the machine, find out who are the system administrators for those instances.
Who are the database owners in each SQL database? Again, a database owner has full access in the database it owns. Is any owner a user account? If so, why the user needs unrestricted access to the database?
Is guest account enabled in any of the databases? Talk with your database developers and designers to see why this has been left enabled.
Is the server running under Windows authentication or Mixed-mode authentication? If in mixed-mode, has the sa account been configured with a password policy? Is there a large number of SQL logins listed? Run a query to find if they also have password polices and expiration rules enabled.
For further details on security audit, you can go tho this link:
This is the first installation of a three-part series I wrote for sql-server-performance.com
This is by no means an exhaustive list, but this will at least give you a starting point. And of course each server instance is different, so not every point described here is applicable for every server.
What you can do though is to create a generic template that includes all the questions and checkboxes and create a new document from it each time you run an audit. You can create such simple templates using Excel spreadsheets or simple Word forms.
When you have finished your investigation and documented your findings, categorize the problem areas by severity levels. You can perhaps start with three generic levels of importance: high, medium and low. Backups not happening for any database – that’s a critical problem. Tempdb not configured properly – perhaps that can be classified as medium. You can also color code the severity levels; red for critical problems, orange for medium intensity problems and yellow for low importance ones.
If your team management is waiting for a report from your audits, you should also create another document. This will be a high-level, bird’s eye view of the detailed audit and preferably a presentation file or a short Word document. Managers are busy people like you and they are interested about only the most pressing matters. Your high-level presentation should highlight the most critical findings, list your recommendations for resolving those issues and also mention any risks or costs associated with the proposed solutions.]]>