SQL Server Audit Walkthrough
So, you are the Database Administrator in your company and
your boss wants you to “have a look” at a SQL Server that you would be managing
from now on. You didn’t know about it before, somebody else had installed it
before you came on board, but now your work is to make sure it’s running as
smoothly as possible.
This is a fairly common scenarios in a DBA’s work life, but
you may be wondering about your next step. Where do you start? What do you do
with the server? How do you find out what’s good and bad about it?
Understanding the Server and Environment
The first thing you should do is to perform an “audit”
on the server. By “audit”, I mean “taking a good look” at the server and SQL Server’s
health – like your boss said. This is a necessary step, in fact the first step
in successfully managing the system. Establish a baseline of the server and it
becomes clear if you need to do anything.
Purpose of the Server
First of all, try to understand the purpose of the server.
What application data is it hosting? Is it a SharePoint back-end for a Project
Management application? Is it a stand-by server used for high availability? Is
it a stand-alone reporting server for month-end reporting or is it a big, beefy
OLTP sever thousands of people are accessing every day? Knowing your server
starts with the step of knowing its purpose.
Did the server go through an “integration” process before
it was handed over to you? By “integration process” I mean a step-by-step
method where the server’s maintenance requirements have been agreed upon. Did
your DBA team or the IT department undertake the maintenance task of the server
as part of a formal handover process, or has it been dumped on you out of the
blue? You need to know this because if it has been formally handed to you or
your team, other people are also responsible for its maintenance and you have
helping hands available. If not, you are the only one who knows about it and
you may need to get other teams involved for its future maintenance.
This principle still holds if you are a paid external
consultant doing an audit for a client. You need to let your client know they
have a server that needs multiple teams’ involvement.
Production vs. Non-production System
Is it a production box or a non-production system?
Non-production systems can include development, testing, QA or training servers.
Why would you want to find out this? Well, all your
production servers could all be following an SOE (Standard Operating Environment).
This can include operating system, patch level, disk layout, memory etc. If
this new system is something different, you would probably want to take note
and try to standardize it later.
How important is the server? Is it a mission-critical
24×7 system? Or is this something your company can live without for a few hours
or few days? Is there any SLA (Service Level Agreements) attached with it? What’s
Is there any governance plan or responsibility matrix
defined for it? Is there an escalation process defined for incidents or
problems – who do you contact when there is a crisis?
Business and Application Owners
Who owns the server? In other words, who’s your client?
Who pays for it? In management terms, who is the “business owner” of the
server? This can be a team internal to your organization (e.g. finance, HR,
design, production, marketing) or this can be an individual – like a client rep
from a company. In most cases you will be dealing with one or two individuals
from your client side. Whoever it is, you need to know your contact. You need
to build a professional relationship with them from the very beginning.
Why? Because when the time comes, it will be them you
would be answering to. It’s them you need to talk to when your server has
crashed and you are trying to recover it. It’s them you need to make aware about
the risks of no applying the latest service pack.
Knowing the Technical Teams
Know the reps from all the other teams who could be
“managing” this server. Typically, this would be a coordinated effort of
various roles and teams within IT. Infrastructure or Hosting Services would be
looking after the server’s physical health, SAN storage, virtualisation, general
server administration and so on. Networking teams would be inevitably involved
in configuring access points like load balancers and accelerators. Firewalls,
antivirus and security access monitoring would be in the realms of network
security team. The applications team could very well have one or two senior
developers acting as de-facto administrators of the system.
I find it essential that every DBA grows a friendly and
professional relationship with his colleagues from other teams. That’s because
it’s them you would be taking some of your demands to and it’s you they would
need to consult before making any changes to the server. You could be sharing
the same office floor or sitting in opposite sides of the globe but the same
Technical Audit: Overall Server Health
Now we start looking at the technical side of the server.
In the following sections I have listed a number of health checks that you
should consider. Some of these are SQL Server-based; others are from a Windows
perspective. For some of the audits you should have full admin access to the system.
If you don’t have that access, you need to talk to the server administrators.
Accessing the System
How do you access the DB server? Is it in the same subnet
or a different one? Can you access it by name or do you have to use an IP
address? Can you access it directly from your desktop with Management Studio or
do you have to remotely log in? What type of remote connection should you use: Windows
RDP or other tools like VNC? What account do you use: your Windows credentials
or a dedicated account? Does your DBA account have full admin access to the
server and does it have sysadmin rights to the SQL instance?
Physical vs. Virtual Servers
Is the machine a physical server or a VM? Or is this a
If it’s a physical machine, you may want to know if it
has some sort of protection against hardware failure like cluster, Always-On or
other DR mechanism. Is the hardware under manufacturer warranty or is it
running with ageing parts that can fail any time?
If this is a clustered environment, check the failover
cluster administrator applet; see what resources are clustered, check if there
are any errors logged.
If this is a virtual machine, see if it has been assigned
enough capacity and processing power.
You can find this kind of information with the help of
your server infrastructure team. That’s why it makes sense to be in good terms with
How are the actual physical resources looking like? Is
the CPU under continuous load (85% and above)? Is the memory under continuous
pressure with very little free amount? How much RAM is installed? Is it
something low like 2GB or is it in somewhat respectable region like 8 GB or
more? How much disk space is available? Is there any drive running out of space
with less than 10% free?
Check the Local Administrators group in Windows. Does it have
only generic service accounts or does it include other people too? Is there any
AD group listed there?
Generally, only a handful of accounts should have local
administrator rights to the server. If you think there are more accounts listed
than there should be, take note. You need to talk to your server administrators
Third-Party Tools and Apps
Is there any application or server installed in this
machine other than SQL? What are those applications? Ideally a DB server should
not have any other applications installed in it. If there are, you need to know
why they have been installed there.
Is there any shared folder in any drive? What accounts
have permissions to those shared folders? What accounts have security access
there? It’s not uncommon to see the “Everyone” group having full access to
shared folders. If that’s the case here, you need to take note and discuss with
your app owners and server administrators. This is a security risk.
Is there any 32-bit or 64-bit ODBC connection configured
in the machine? What data sources do they point to? Can you test the ODBC connectivity
Is there any Windows scheduled tasks in this server? Are
these jobs enabled? Are they succeeding? If they are failing, check any output
log files or event log to investigate the root cause: perhaps it’s because a
network folder can’t be reached or a .BAT file is missing?
Antivirus Exclusions Rules
Is the server running any anti-virus software (e.g.
McAfee)? What are the scanning rules? Database files are usually always open
and AV software is typically configured to exclude them from scanning.
Exception to this can cause data files being locked for extended periods of
time, resulting in significant I/O delays and high CPU usage. Check if typical
database files like .mdf, .ndf, .ldf, .trn., and bak are excluded from virus
Is the server monitored? Are there any triggers and
alerts for trapping critical system events? What events are monitored? What
monitoring tool is used and how is it configured? What’s the method of
notifying server administrators (e-mail or text message)? Are you going to be
notified as a DBA when SQL Service stops unexpectedly or a disk drive runs out
Technical Audit: SQL Server Health
Now, with the general server audit over and done with,
let’s start with the SQL part.
SQL Server Components Installed
What components of SQL Server suite is installed in the
machine? Is it only SQL Service or is it Analysis Service or Reporting Service
as well? Is there any other sub-component like Master Data Services running?
Sometimes you will find systems where SSAS or SSRS has
been installed but doesn’t host any databases or reports. You need to find out
if the unused services can be stopped and disabled.
Check if there is more than one SQL instance running in
the machine. Is any instance stopped or disabled?
SQL Server Version, Edition and Service Pack
What’s the version and edition of SQL Server running in
this machine? Is it running an out-of-support version? May be you can recommend
upgrading it later. What service pack is installed here? Is it still under
Microsoft mainstream support? If not, why can’t the service pack be upgraded?
What service account is SQL Server running under? Has
someone left it with the default Local System account (or worse, someone’s
Windows AD account) or is it running under a dedicated service account? Take
note if it’s not running under a service account because this is not a good
SQL Server Configuration Properties
Check the SQL Server configuration parameters and different
facets. You can find the information from SQL Server Properties window, command
output of sp_configure or the Configuration Manager applet.
Is the maximum server memory assigned a specific value or
is it accessing the entire available RAM? If it’s accessing all the available
memory, you need to take note and come back later to change the default value.
Is xp_cmdshell enabled? In most cases this is considered
a major security loophole; in your server instance this could be necessary for
legacy code to run.
Is Remote DAC enabled? You may want to enable it later so
you have a backdoor to connect to the server when it’s unresponsive.
Is OLE automation enabled? What’s the fill factor? How
about Max Degree of Parallelism (MAXDOP)? If MAXDOP has a value other than
zero, you need to find why it has been changed from the default setting.
What about the network protocols for SQL Server? Is Named
Pipes enabled for this instance or is it only TCP/IP? What’s the port number
for SQL? For a default instance, is this running under port 1433 or is there a
custom port defined? Is there any alias defined for client side? What are those
Is there any start-up trace flags defined for SQL
Service? You can find this from the Configuration Manager’s service property.
Error Logs and Event Logs
Have a look at the SQL Server error log. Look through at
least last seven days’ data. Is there any error or warning repeatedly being posted?
Backup failures, databases running out of space, SQL service stopping
unexpectedly, I/O delays taking more than fifteen seconds, deadlocks, tempdb
filling up and other warnings should be noted: These are probably the first
things you need to fix in this server.
Similarly, have a look at the SQL Server Agent log. Look
for warnings and critical errors.
And while you are checking the error logs, see if they
are configured properly. Ideally, an error log should cycle (start a new log)
every day and there should be at least seven days’ logs present. Is the server
configured to refresh the logs?
Next, check the Windows event logs: namely the
Application, System and Security logs. Is there any critical error happening in
Database File Locations
As a DBA, you should follow a consistent layout for your
database servers’ disks. Personally, I use the following convention:
For operating system files and virtual memory space, use
C:\ drive. Use D:\ drive for SQL Server and other application binaries (Program
Files and Program Files x86)). This is also used for any shared data folders. I
designate E:\ drive for only database data files and L:\ drive for only transaction
log files. F:\ drive is used for database backup files and finally T:\ drive is
dedicated to tempdb data and log files.
Now the separation of data and log files may seem to be a
thing of the past in recent days as more and more servers are virtualised and
shared storage is used to host data. I still follow this layout for two
First, it keeps things simpler and neater. You know where
your files are. Secondly, there is less chance of disks running out of space: your
data files are not competing for space with the backup files and your tempdb
has its own dedicated space to grow.
However, it’s not uncommon to see SQL installations where
binaries and data files have all been placed under C:\. To mitigate this, you
can potentially move the databases to a new drive and also change the default
data and log file paths for SQL Server.
Before running any checks on the databases, have an
initial look: is there any database in suspect, shutdown, offline, single user
or read-only mode? If there is, why is it so? Check with the applications team.
Next, find the recovery modes of the databases. Is there
any database in simple recovery mode? Why so?
Database File Properties
What’s the data and log file growth property for each
database? Is the data or log restricted to a predefined size or are they
allowed growing indefinitely? How much do they grow by?
By default, data files grow by 1 MB and log files by 10%
of their size. I always change this because it can cause two types of problems.
The first one is related to fragmentation and I/O delays. Every time the data
file needs to grow the database storage engine will need to request 1 MB space
from the operating system. Any I/O operation will then have to wait till the
storage has been assigned. For a busy system the data file could be expanding quite
frequently as inserts and updates are happening. Collectively all these micro expansions
would create a lot of back-and-forth operations between SQL and Windows,
resulting in index and data page fragmentation.
The second problem is with log files growing by a
percentage. As an example, a 4 GB log file would grow by 400 MB when an
expansion happens. That’s almost half a GB disk space gone in one swoop. The
larger the log file, the larger would be the space assigned.
So what do you do? As a rule of thumb, log files should be
one fourth or half the size of data files. So first of all, make sure the log
files have been shrunk if they have a lot of empty space available. You can
easily find this information from the database properties and Management Studio
reports. Next, change the file growth properties of both types of files. For
example, data files can grow by 512, 256 or 128 MB while log files can grow by
say, 128 or 64 MB. This way, you know the files will grow by a definite number
and not by some arbitrary value.
Other Database Properties
What’s the compatibility mode of each database? Is it the
latest version of SQL or is it from an older version? This can happen if the database
has been upgraded to a newer version.
How are the statistics maintained? Are the auto-create
and auto-update statistics properties enabled for each database?
And what’s the page verify option? Is it set to checksum
or torn page detection? Again, the latter can happen if the database has been
upgraded from a previous version.
Is service broker enabled for any of the databases? Is
partitioning being used? Has Transparent Data Encryption (TDE) been enabled?
Run queries to find the relevant information from sys.databases or sys.dm_database_encryption_keys
And while you are at it, check if any database has a full
text catalogue defined. If there is, look for schedule jobs that update it
Now check if the databases are backed up regularly. When was
the last time they were fully backed up? When was the last time their
transaction logs were backed up? If the answer to either of these questions is
“never”, you have a problem at hand: you need to start backing up the
Look for maintenance plans or customised jobs that back
up the databases. Check if these jobs are enabled and succeeding. If there are
no SQL jobs but databases are still showing recent backup dates then it could
be happening through a third-party tool. Check with your application owners or
Check the tempdb database. How many data files has it
got? Is the number of data files same as the number of logical CPUs shown in Task
Manager? If not, the tempdb is not optimally configured for I/O. Check the data
file sizes. Are they all of the same size? Are they big enough? It’s not
uncommon to see a high-transaction OLTP system having a 4 GB tempdb database.
Increase the data file sizes and the number of data files as necessary.
Master Database Objects
Check the master database for non-system, user-generated
objects like tables, stored procedures or functions. This is bad programming
practice but unfortunately many shops have them and you can’t do much to change
because clearing the master database would mean significant changes in application.
All you can do is to talk to your application team so they stop doing this and
consider re-architecting the application.
Now it’s time to check the physical health of the databases.
Run the DBCC CHECKDB command against each of the databases and watch for any
error messages. Sometimes the command will fail when the database has been
upgraded from a previous version. Run the DBCC UPDATEUSAGE command and then run
If there are databases failing CHECKDB with a minimum
option of REPAIR_ALLOW_DATA_LOSS, you need to talk to your application owners.
The database has physical consistency errors and you need to understand what
your client wants you to do. You can restore the database from a recent good backup
if one is available.
Check the index fragmentation in each database. You can
run a query involving dynamic performance function
(sys.dm_db_index_physical_stats) and look through the average fragmentation in
percent column for values above 30. Alternatively, run a Management Studio
Report called Index Physical Statistics. If there are tables with more than 30%
fragmentation, you need to consider rebuilding those indexes.
Database Maintenance Plans can automate some DBA tasks
like auto-updating statistics or shrinking database files. Check the instance and
see if Maintenance Plans have been created. Mainly you would want to see if scheduled
jobs are shirking database files, updating statistics, reorganizing and
rebuilding indexes, running database consistency checks and so on. If there are
no such jobs, take note; maintenance plans need to be created. Similarly, investigate
the root cause if the jobs are present but failing.
SQL Server Agent Jobs
Jobs are integral part of many SQL Server applications.
They can perform scheduled repetitive tasks and implement custom business
logic. We talked about Maintenance Plan jobs in the previous section; in some
systems these are all you would find. Other systems will have dozens of custom
jobs running in all different schedules. As a DBA, you don’t need to know what
these jobs are doing; you need to know two things only.
Firstly, what jobs are failing and what’s the root cause
of the failure? Is it because a database or stored procedure is missing? Or is
it because the job’s owner doesn’t have sufficient privileges or is it due to a
primary key violation? How long the job’s been failing? If it’s been failing
for a long time and nobody has reported any error, ask if the job can be
Secondly, check the jobs’ owners. Are these people’s Windows
AD accounts? There is a potential problem here. People’s AD accounts can have full
rights in the network and the job would successfully run as long as the account
is active. But once the person creating the job leaves the company, his account
would probably be deactivated, disabled or deleted. The job would then start to
fail. That’s why it’s best to run jobs under SQL Server Agent service account
or another application-specific system account: accounts that don’t get deleted
or locked out.
As you finish your audit of the jobs, take a quick look
and see if any operators, proxies or credentials have been defined for the
What linked servers have been created in this SQL instance?
Are they working? Can you access the remote data sources by expanding the “linked
servers” node? What security credentials are they using?
Is database mail enabled and configured in this server?
Check the database mail log for errors. Send a test e-mail from SQL to your own
e-mail address to check the functionality.
Is the server configured for high availability?
We touched the topic of checking Windows clustering error
messages from the Failover Cluster Administrator applet. What about Always-On?
Is there an availability group defined? Check the status and see if the node is
Similarly, check the status of log shipping or mirroring
if they have been configured. You can run a Management Studio report for
transaction log shipping status. Successfully mirrored databases will have a
status of “Synchronized” or “Synchronizing” written beside the database name;
you can also use the database Mirroring Monitor for this purpose.
Is the server replicating data with another system? Is it
a publisher or a subscriber? Launch the Replication Monitor and see if any
replications are failing or have expired. You need to reinitialize failed or
Other SQL Server Technologies
There can be a number of newer technologies enabled in
the instance. We will briefly touch on those.
Is Resource Governor enabled? What Resource Pools have
been defined? Can you manually evaluate them and see the results?
Similarly, is Policy-based Management enabled? Is there
any customized, non-system specific policy defined? Can you evaluate them?
Is there any server-based audits defined? What are the
specifications for those audits?
Has Management Data Warehouse (MDW) been enabled? Are the
Data Collectors running? Are the SQL Server jobs collecting data for MDW
running successfully or are they failing? Can you connect to the MDW and view
This is a topic that can take an entire article by
itself. Here are few things you can check in your SQL Server.
First of all, who are the members of the sysadmin server
role? Accounts under this role have full system-wide privileges and naturally
you want to restrict that. Ideally, only DBA accounts and perhaps some application
service accounts should be members of this role. If that’s not the case, take
note. You need to talk to your app owners about potential security threat and
plan for hardening security.
Similarly, if you have Reporting Service or Analysis
Service running in the machine, find out who are the system administrators for
Who are the database owners in each SQL database? Again,
a database owner has full access in the database it owns. Is any owner a user
account? If so, why the user needs unrestricted access to the database?
Is guest account enabled in any of the databases? Talk
with your database developers and designers to see why this has been left
Is the server running under Windows authentication or Mixed-mode
authentication? If in mixed-mode, has the sa account been configured with a password
policy? Is there a large number of SQL logins listed? Run a query to find if
they also have password polices and expiration rules enabled.
For further details on security audit, you can go tho
This is the first installation of a three-part series I
wrote for sql-server-performance.com
This is by no means an exhaustive list, but this will at
least give you a starting point. And of course each server instance is
different, so not every point described here is applicable for every server.
What you can do though is to create a generic template that
includes all the questions and checkboxes and create a new document from it
each time you run an audit. You can create such simple templates using Excel
spreadsheets or simple Word forms.
When you have finished your investigation and documented
your findings, categorize the problem areas by severity levels. You can perhaps
start with three generic levels of importance: high, medium and low. Backups
not happening for any database – that’s a critical problem. Tempdb not
configured properly – perhaps that can be classified as medium. You can also
color code the severity levels; red for critical problems, orange for medium
intensity problems and yellow for low importance ones.
If your team management is waiting for a report from your
audits, you should also create another document. This will be a high-level,
bird’s eye view of the detailed audit and preferably a presentation file or a
short Word document. Managers are busy people like you and they are interested about
only the most pressing matters. Your high-level presentation should highlight
the most critical findings, list your recommendations for resolving those
issues and also mention any risks or costs associated with the proposed solutions.