Replicating A Volume Of Large Data via Transactional Replication

During weekend maintainence, members of the support team executed an UPDATE statement against the database on the OLTP Server. This database was a part of Transactional Replication and the database on the subscriber was used by the SSRS reports. The UPDATE statement impacted around 3500000 records. Once the UPDATE statement was executed, the Replication procedure came to a halt and the error message shown in the View Synchronization Agent tab was Timeout Expired. At this stage ensuring that I have around 12 hours left with me for the users in Australia to come ONLINE, I decided to rebuild the Replication procedure again but it was a long process. It takes around 20 hrs to rebuild the entire procedure because the size of the database is 140 GB and we have 100 GB of indexes on the Reporting database which are placed on a separate SAN of RAID 10 for better performance of the SSRS reports. Even though I rebuilt the entire procedure again, we still had an outage for a period of around 10 hrs. I decided to work on this case in much more depth and try to find a more efficient solution.

If a T-SQL is executed against the publisher database which performs a large Insert/Update or a Delete operation for e.g. in our case the number of records impacted were around 3500000, following error can occur:

The process could not execute 'sp_replcmds' on 'xxxxxx'.
Status: 2, code: 0, text: 'Timeout expired'
A time out occurred while waiting for memory resources to execute the query.
Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job history in the Jobs folder for more details.
The step was cancelled (stopped) as the result of a stop job request
   

The Replication topology works according to the values which have been set for the particular agent. There are 5 types of agents which the Transactional Replication procedure uses:

  • Distribution Agents
  • Merge Agents
  • Log Reader Agents
  • Snapshot Agents
  • Queue Reader Agents

In order to view the same, please perform the following steps:

  1. Connect to SQL Server Management Studio on the OLTP instance
  2. Right Click on the Replication Node in the Object Explorer in the SQL Server Management Studio and click on Launch Replication Monitor, as below.

  3. Right Click on the Publisher Name and Select Agent Profiles, as below.

  4. You should be able to view the below screen.

  5. Each of these Agents has a Default Agent Profile associated with it. Let’s have a look at the Default Agent Profile for the Distribution Agent as shown in the screen capture below:

  6. The important here is the Query Timeout parameter which is 1800 seconds (30 minutes)
  7. For Merge Agent the Query Timeout value is 300 seconds as shown below:


    For all the other Agent Profiles it is 1800 seconds

Thus, if the operation associated with these particular agents is not completed within the time specified in the Query Timeout box, the Replication procedure will halt with the error Query Timeout Expired and it will terminate.

In order to overcome this, instead of using the Default Agent profile for the corresponding Replication Agent, I decided to create a new profile for each of them. Please note that you cannot modify the default agent profile as it is generated by the system.

I created a new profile named Satnam_Profile for each of the Replication Agents and specified the Query Timeout value as 65533 seconds which is the maximum value, as shown below:

Once done for a particular Agent profile, I then clicked on Change Existing Agents and then Clicked OK as shown in the screen capture below.

 This procedure was repeated for each of the Replication Agent profiles.

Once this change was made during the next maintainence window when a heavy DELETE statement was executed against the OLTP database which impacted around 1500000 records, the following observations were made.

  • On the Publisher the deletion script took 7 min and 16 seconds to delete 1500000 records
  • The Log Reader Agent took around 8-10 minutes to capture the data
  • On the subscriber, the synchronisation Agent took around 12-15 minutes  to replicate the data successfully without any issues

Therefore with the above changes the Replication procedure worked without any issues.

This procedure will also ensure that if you are performing an ONLINE Rebuilding and Update Statistics operation against a very busy database the Replication procedure will not be halted. Sometimes it has been observed that few indexes are so big in size such that they never get rebuilt within the specified query timeout error time as 30 minutes due to which the data doesn’t gets delivered onto the subscriber properly.

This was one such approach which I used to ensure that even though I am performing a huge DML operation against a particular database, the Transactional Replication procedure shall not get impacted at all. If there are any suggestions, feedback then please do let us know. Alternatively you can email me on singhsatnam84@yahoo.com.

]]>

Leave a comment

Your email address will not be published.