Monitoring Transactional Replication – The Distribution Queue
Note : The scripts for this article can be downloaded
the distribution queue is an important link in the chain of transactions
delivery over a replicated topology. A chain that includes the log reader agent
scanning the published database’s transaction log for commands to be moved to
the distribution database from which a distribution agent delivers the commands
to the subscription database typically located on the subscriber server.
upcoming articles I will be examining other key links in that chain but
for now let’s focus on the distribution database and see what information we
can get there and how we can use that information for monitoring transactional
The bellow SQL
statement returns valuable information on the current status of the distribution
database in terms of commands delivery. It allows us to see the number of commands
delivered and the number of commands not yet delivered at the article level per
USE distribution; SET NOCOUNT ON; SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; SELECT DISTINCT @@SERVERNAME –,a.article_id ,a.Article ,p.Publication ,SUBSTRING(agents.[name], 16, 35) AS [Name] –,s.agent_id ,s.UndelivCmdsInDistDB ,s.DelivCmdsInDistDB –,UndelivCmdsInDistDB + DelivCmdsInDistDB AS TaotalTrans FROM dbo.MSdistribution_status AS s INNER JOIN dbo.MSdistribution_agents AS agents ON agents.[id] = s.agent_id INNER JOIN dbo.MSpublications AS p ON p.publication = agents.publication INNER JOIN dbo.MSarticles AS a ON a.article_id = s.article_id and p.publication_id = a.publication_id WHERE 1=1 AND s.UndelivCmdsInDistDB <> 0 AND agents.subscriber_db NOT LIKE ‘virtual’ –AND p.Publisher_db = ” –AND a.Article LIKE ” –AND p.Publication IN (”) –AND s.agent_id NOT IN () ORDER BY s.UndelivCmdsInDistDB DESC OPTION(RECOMPILE);
This allows us to see if a specific article is slow on delivery per a
given subscriber or to all subscribers (if are more than one subscriber defined).
stored procedure is built on top of the above SELECT statement and is designed
to be executed by a monitoring tool.
procedure accepts three input parameters: @article, @publication and @threshold
and returns the number of commands that had crossed the predefined threshold for
the specific article and had not yet been delivered to the subscriber.
the procedure by your monitoring tool it is recommended that you are familiar
with the environment so that false alarms are minimized.
starting with a single article that is highly active, that is an article that gets
frequent modifications (INSERT, UPDATE or DELETE). This allows you to come up
with a figure that can be used as a threshold indicating the number of commands
that you do not expect to be reached at the distribution queue and having
reached that threshold you will know something is not working as expected.
If you find
it hard to come up with a correct representing figure I suggest you gather some
statistics first by running the procedure in the form of INSERT…EXEC every 5
minutes or so for a week or two and save the results to a table which will
allow you to see the MIN, MAX and AVG number of transactions in the queue and
also you will be able to see if the pick numbers are related to a specific time
in the day.
figure out your correct threshold you can execute the stored procedure every 5
minutes or so from your monitoring tool while passing the @threshold parameter
a value that you do not expect to be reached as follows:
EXEC distribution.dbo.MonitorReplicationArticleCommands @article = N’Spot’, @publication = N’FX’, @threshold = 3000;
example is based on the fact that under normal conditions the number of commands
in the queue for the given article is around 1000. Since the article is
frequently modified If there is a problem it is only a matter of a short time
until the threshold is crossed and some action is triggered. I do not use a
lower threshold to avoid false alarms.
that you now have the infrastructure needed to be notified if a specific
article has a queue in the distribution database exceeding your predefined
procedure does not return an empty result set a notification email should be
sent and you should perform a manual check to understand the reason for the
growth in queue or alternately, automate that check by executing a set of
actions from your monitoring tool.
There are various
reasons for the queue to grow and I may delve into that in greater depth in the
future but for now I will mention the most common reasons I think of at this
1. A wait
for resource (a blocking lock) at the subscriber slows down or prevents the
commands to be applied at the subscription database.
network connectivity between the distributor and the subscriber is slow or down
for some time.
3. A high
number of DML commands applied at the publisher within a short time period.
even if you publish several articles but only monitor the most suitable
candidate in this way, many of the problems detected via this article can be
relevant for the entire publication. In the above three reasons that I mention,
number 2 is a common source of failure for all articles.
this query and other queries I added several indexes on tables MSdistribution_history,
MSpublications and MSarticles in the distribution database (see
attached script Distribution_Indexes.sql).
Note that I
do not add indexes on MSrepl_commands table even though we can sure
benefit from some indexes on that table while reading data but these typically
lead to deadlocks when modifying data.
to our discussion so far I provide a stored procedure named MonitorReplicationCommands
(built on the code used in sp_tablesize)
that returns the number of rows and size in MB of a any given table that is
passed as a parameter to the @table input variable received by the procedure.
you to monitor MSrepl_commands table which is the primary table along
with MSrepl_transactions that serve as the distribution queue for commands to
be delivered to subscribers.
the size of this table provides a more generic view on the queue and can be
used in the same manner as the procedure MonitorReplicationArticleCommands
but at the entire queue level instead of the table level.