Replicating A Volume Of Large Data via Transactional Replication

During weekend
maintainence, members of the support team executed an UPDATE statement against
the database on the OLTP Server. This database was a part of Transactional
Replication and the database on the subscriber was used by the SSRS reports.
The UPDATE statement impacted around 3500000 records. Once the UPDATE
statement was executed, the Replication procedure came to a halt and the error
message shown in the View Synchronization Agent tab was
Timeout Expired. At this stage ensuring that I have around 12 hours left with
me for the users in Australia to come ONLINE, I decided to rebuild
the Replication procedure again but it was a long process. It takes around 20 hrs
to rebuild the entire procedure because the size of the database
is 140 GB and we have 100 GB of indexes on the Reporting database which
are placed on a separate SAN of RAID 10 for better performance of the SSRS
reports. Even though I rebuilt the entire procedure again, we still had an
outage for a period of around 10 hrs. I decided to work on this case in much
more depth and try to find a more efficient solution.

If a T-SQL is executed against the
publisher database which performs a large Insert/Update or a Delete operation
for e.g. in our case the number of records impacted were around 3500000,
following error can occur:

The process could not execute 'sp_replcmds' on 'xxxxxx'. 
Status: 2, code: 0, text: 'Timeout expired'
A time out occurred while waiting for memory resources to execute the query.
Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job history in the Jobs folder for more details.
The step was cancelled (stopped) as the result of a stop job request
   

The Replication topology works according to
the values which have been set for the particular agent. There are 5 types of
agents which the Transactional Replication procedure uses:

  • Distribution Agents
  • Merge Agents
  • Log Reader Agents
  • Snapshot Agents
  • Queue Reader Agents

In order to view the same, please perform the following steps:

  1. Connect to SQL Server Management Studio on
    the OLTP instance
  2. Right Click on the Replication Node in the
    Object Explorer in the SQL Server Management Studio and click on Launch
    Replication Monitor, as below.

  3. Right Click on the Publisher Name and Select
    Agent Profiles, as below.

  4. You should be able to view the below screen.

  5. Each of these Agents has a Default Agent Profile
    associated with it. Let’s have a look at the Default Agent Profile for the
    Distribution Agent as shown in the screen capture below:

  6. The important here is the Query Timeout parameter which is 1800 seconds (30 minutes)
  7. For Merge Agent the Query Timeout value is 300
    seconds as shown below:

    For all the other Agent Profiles it is 1800
    seconds

Thus, if the operation associated
with these particular agents is not completed within the time specified
in the Query Timeout box, the Replication procedure will halt with the error
Query Timeout Expired and it will terminate.

In order to overcome this, instead of using
the Default Agent profile for the corresponding Replication Agent, I decided to
create a new profile for each of them. Please note that you cannot modify the
default agent profile as it is generated by the system.

I created a new profile named Satnam_Profile
for each of the Replication Agents and specified the Query Timeout value as
65533 seconds which is the maximum value, as shown below:

Once done for a particular Agent profile, I
then clicked on Change Existing Agents and then Clicked OK as shown in the
screen capture below.

 This procedure was repeated for each of
the Replication Agent profiles.

Once this change was made during the next
maintainence window when a heavy DELETE statement was executed against
the OLTP database which impacted around 1500000 records, the
following observations were made.

  • On the Publisher the deletion script
    took 7 min and 16 seconds to delete 1500000 records
  • The Log Reader Agent took around
    8-10 minutes to capture the data
  • On the subscriber, the synchronisation
    Agent took around 12-15 minutes  to replicate the data successfully without any
    issues

Therefore with the above changes
the Replication procedure worked without any issues.

This procedure will also ensure that if you
are performing an ONLINE Rebuilding and Update Statistics operation against a
very busy database the
Replication procedure will not be halted. Sometimes it has been
observed that few indexes are so big in size such that they never get rebuilt
within the specified query timeout error time as 30 minutes due to which the
data doesn’t gets delivered onto the subscriber properly.

This was one such approach which I used to
ensure that even though I am performing a huge DML operation against a
particular database, the Transactional Replication procedure shall not get
impacted at all. If there are any suggestions, feedback then please do let
us know. Alternatively you can email me on singhsatnam84@yahoo.com.




Array

No comments yet... Be the first to leave a reply!