Monitoring Transactional Replication – The Distribution Queue

Note : The scripts for this article can be downloaded here.

Monitoring the distribution queue is an important link in the chain of transactions delivery over a replicated topology. A chain that includes the log reader agent scanning the published database’s transaction log for commands to be moved to the distribution database from which a distribution agent delivers the commands to the subscription database typically located on the subscriber server.

In an upcoming articles I will be examining other key links in that chain but for now let’s focus on the distribution database and see what information we can get there and how we can use that information for monitoring transactional replication.

The bellow SQL statement returns valuable information on the current status of the distribution database in terms of commands delivery. It allows us to see the number of commands delivered and the number of commands not yet delivered at the article level per each subscriber.

USE distribution;
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT DISTINCT
       @@SERVERNAME
      –,a.article_id
      ,a.Article
      ,p.Publication
      ,SUBSTRING(agents.[name], 16, 35) AS [Name]
      –,s.agent_id
      ,s.UndelivCmdsInDistDB
      ,s.DelivCmdsInDistDB
–,UndelivCmdsInDistDB + DelivCmdsInDistDB  AS TaotalTrans
FROM dbo.MSdistribution_status AS s
INNER JOIN dbo.MSdistribution_agents AS agents ON agents.[id] = s.agent_id
INNER JOIN dbo.MSpublications AS p ON p.publication = agents.publication
INNER JOIN dbo.MSarticles AS a ON a.article_id = s.article_id and p.publication_id = a.publication_id
WHERE 1=1 AND s.UndelivCmdsInDistDB <> 0 AND agents.subscriber_db NOT LIKE ‘virtual’
–AND p.Publisher_db = ”
–AND a.Article LIKE  ”
–AND p.Publication IN (”)
–AND s.agent_id NOT IN ()
ORDER BY s.UndelivCmdsInDistDB DESC
OPTION(RECOMPILE);

This allows us to see if a specific article is slow on delivery per a given subscriber or to all subscribers (if are more than one subscriber defined).

The MonitorReplicationArticleCommands stored procedure is built on top of the above SELECT statement and is designed to be executed by a monitoring tool.

The procedure accepts three input parameters: @article, @publication and @threshold and returns the number of commands that had crossed the predefined threshold for the specific article and had not yet been delivered to the subscriber.

Before implementing the procedure by your monitoring tool it is recommended that you are familiar with the environment so that false alarms are minimized.

I recommend starting with a single article that is highly active, that is an article that gets frequent modifications (INSERT, UPDATE or DELETE). This allows you to come up with a figure that can be used as a threshold indicating the number of commands that you do not expect to be reached at the distribution queue and having reached that threshold you will know something is not working as expected.

If you find it hard to come up with a correct representing figure I suggest you gather some statistics first by running the procedure in the form of INSERT…EXEC every 5 minutes or so for a week or two and save the results to a table which will allow you to see the MIN, MAX and AVG number of transactions in the queue and also you will be able to see if the pick numbers are related to a specific time in the day.

Once you figure out your correct threshold you can execute the stored procedure every 5 minutes or so from your monitoring tool while passing the @threshold parameter a value that you do not expect to be reached as follows:

EXEC distribution.dbo.MonitorReplicationArticleCommands @article = N’Spot’, @publication = N’FX’, @threshold = 3000;

The above example is based on the fact that under normal conditions the number of commands in the queue for the given article is around 1000. Since the article is frequently modified If there is a problem it is only a matter of a short time until the threshold is crossed and some action is triggered. I do not use a lower threshold to avoid false alarms.

Having done that you now have the infrastructure needed to be notified if a specific article has a queue in the distribution database exceeding your predefined threshold.

If the procedure does not return an empty result set a notification email should be sent and you should perform a manual check to understand the reason for the growth in queue or alternately, automate that check by executing a set of actions from your monitoring tool.

There are various reasons for the queue to grow and I may delve into that in greater depth in the future but for now I will mention the most common reasons I think of at this time.

1. A wait for resource (a blocking lock) at the subscriber slows down or prevents the commands to be applied at the subscription database.

2. The network connectivity between the distributor and the subscriber is slow or down for some time.

3. A high number of DML commands applied at the publisher within a short time period.

Note that even if you publish several articles but only monitor the most suitable candidate in this way, many of the problems detected via this article can be relevant for the entire publication. In the above three reasons that I mention, number 2 is a common source of failure for all articles.

To support this query and other queries I added several indexes on tables MSdistribution_history, MSpublications and MSarticles in the distribution database (see attached script Distribution_Indexes.sql).

Note that I do not add indexes on MSrepl_commands table even though we can sure benefit from some indexes on that table while reading data but these typically lead to deadlocks when modifying data.

In addition to our discussion so far I provide a stored procedure named MonitorReplicationCommands (built on the code used in sp_tablesize) that returns the number of rows and size in MB of a any given table that is passed as a parameter to the @table input variable received by the procedure.

It allows you to monitor MSrepl_commands table which is the primary table along with MSrepl_transactions that serve as the distribution queue for commands to be delivered to subscribers.

Monitoring the size of this table provides a more generic view on the queue and can be used in the same manner as the procedure MonitorReplicationArticleCommands but at the entire queue level instead of the table level.

]]>

Leave a comment

Your email address will not be published.