Replicating A Volume Of Large Data via Transactional Replication

During weekend maintainence, members of the support team executed an UPDATE statement against the database on the OLTP Server. This database was a part of Transactional Replication and the database on the subscriber was used by the SSRS reports. The UPDATE statement impacted around 3500000 records. Once the UPDATE statement was executed, the Replication procedure came to a halt and the error message shown in the View Synchronization Agent tab was Timeout Expired. At this stage ensuring that I have around 12 hours left with me for the users in Australia to come ONLINE, I decided to rebuild the Replication procedure again but it was a long process. It takes around 20 hrs to rebuild the entire procedure because the size of the database is 140 GB and we have 100 GB of indexes on the Reporting database which are placed on a separate SAN of RAID 10 for better performance of the SSRS reports. Even though I rebuilt the entire procedure again, we still had an outage for a period of around 10 hrs. I decided to work on this case in much more depth and try to find a more efficient solution.

If a T-SQL is executed against the publisher database which performs a large Insert/Update or a Delete operation for e.g. in our case the number of records impacted were around 3500000, following error can occur:

The process could not execute 'sp_replcmds' on 'xxxxxx'.
Status: 2, code: 0, text: 'Timeout expired'
A time out occurred while waiting for memory resources to execute the query.
Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job history in the Jobs folder for more details.
The step was cancelled (stopped) as the result of a stop job request
   

The Replication topology works according to the values which have been set for the particular agent. There are 5 types of agents which the Transactional Replication procedure uses:

  • Distribution Agents
  • Merge Agents
  • Log Reader Agents
  • Snapshot Agents
  • Queue Reader Agents

In order to view the same, please perform the following steps:

  1. Connect to SQL Server Management Studio on the OLTP instance
  2. Right Click on the Replication Node in the Object Explorer in the SQL Server Management Studio and click on Launch Replication Monitor, as below.

  3. Right Click on the Publisher Name and Select Agent Profiles, as below.

  4. You should be able to view the below screen.

  5. Each of these Agents has a Default Agent Profile associated with it. Let’s have a look at the Default Agent Profile for the Distribution Agent as shown in the screen capture below:

  6. The important here is the Query Timeout parameter which is 1800 seconds (30 minutes)
  7. For Merge Agent the Query Timeout value is 300 seconds as shown below:


    For all the other Agent Profiles it is 1800 seconds

Thus, if the operation associated with these particular agents is not completed within the time specified in the Query Timeout box, the Replication procedure will halt with the error Query Timeout Expired and it will terminate.

In order to overcome this, instead of using the Default Agent profile for the corresponding Replication Agent, I decided to create a new profile for each of them. Please note that you cannot modify the default agent profile as it is generated by the system.

I created a new profile named Satnam_Profile for each of the Replication Agents and specified the Query Timeout value as 65533 seconds which is the maximum value, as shown below:

Once done for a particular Agent profile, I then clicked on Change Existing Agents and then Clicked OK as shown in the screen capture below.

 This procedure was repeated for each of the Replication Agent profiles.

Once this change was made during the next maintainence window when a heavy DELETE statement was executed against the OLTP database which impacted around 1500000 records, the following observations were made.

  • On the Publisher the deletion script took 7 min and 16 seconds to delete 1500000 records
  • The Log Reader Agent took around 8-10 minutes to capture the data
  • On the subscriber, the synchronisation Agent took around 12-15 minutes  to replicate the data successfully without any issues

Therefore with the above changes the Replication procedure worked without any issues.

This procedure will also ensure that if you are performing an ONLINE Rebuilding and Update Statistics operation against a very busy database the Replication procedure will not be halted. Sometimes it has been observed that few indexes are so big in size such that they never get rebuilt within the specified query timeout error time as 30 minutes due to which the data doesn’t gets delivered onto the subscriber properly.

This was one such approach which I used to ensure that even though I am performing a huge DML operation against a particular database, the Transactional Replication procedure shall not get impacted at all. If there are any suggestions, feedback then please do let us know. Alternatively you can email me on singhsatnam84@yahoo.com.




Related Articles :

  • No Related Articles Found

5 Responses to “Replicating A Volume Of Large Data via Transactional Replication”

  1. from 10 hours outage to 30 mins
    Awesome ..
    Too good way..

  2. A consideration for the future is to add and use stored procedures to the publication which will do the updates. It doesn’t matter how many rows the proc updates, only the command exec sp_????? is replicated.

    Inserts are generally not as much of issue as the distribution agent can be configured to with -CommitBatchThreshold and -CommitBatchSize reducing or increasing the number of commands which are replicated at once.

    There is also the relatively hidden parameter -SubscriptionStreams which can improve delivery.

  3. I think I should also point out that in most cases I don’t agree with this statement.
    “huge DML operation against a particular database, the Transactional Replication procedure shall not get impacted at all”

    Its true that replication would continue to function in a fashion, but it will massively increase latency changing what is effectively near real time replication to a 20 minutes delay. Any subsequent commands cannot be delivered until this batch update has been applied to the subscriber.

  4. Satnam,

    A better way to do this is to have the update statement written in a stored proc if this is a repetitive task and replicate the stored with execution and this will not impact anything in replication.

  5. Is a delete of 1,500,000 rows equivalent to an update of 3,500,000 rows?

    If not, then the times wouldn’t comparable.

    It would be interesting to know how long the update of the same 3,500,000 rows took under the new scenario.

    Increasing the QueryTimeOut doesn’t really cut down the latency of large transactions. Oh, it does take care of your timeouts, so it can let things run. (see: http://www.replicationanswers.com/TransactionsInTRSQL2005.asp )

    There’s all kinds of settings, for example the SubscriptionStreams settings that you could try to see if adjusting them improved performance in the distribution to subscriber traffic.

    There’s things like using partitioning with replication, too. Done correctly, this could allow you to effectively truncate a large table via partition switching. Much faster than a delete.

Software Reviews | Book Reviews | FAQs | Tips | Articles | Performance Tuning | Audit | BI | Clustering | Developer | Reporting | DBA | ASP.NET Ado | Views tips | | Developer FAQs | Replication Tips | OS Tips | Misc Tips | Index Tuning Tips | Hints Tips | High Availability Tips | Hardware Tips | ETL Tips | Components Tips | Configuration Tips | App Dev Tips | OLAP Tips | Admin Tips | Software Reviews | Error | Clustering FAQs | Performance Tuning FAQs | DBA FAQs |