Replicating A Volume Of Large Data via Transactional Replication
maintainence, members of the support team executed an UPDATE statement against
the database on the OLTP Server. This database was a part of Transactional
Replication and the database on the subscriber was used by the SSRS reports.
The UPDATE statement impacted around 3500000 records. Once the UPDATE
statement was executed, the Replication procedure came to a halt and the error
message shown in the View Synchronization Agent tab was
Timeout Expired. At this stage ensuring that I have around 12 hours left with
me for the users in Australia to come ONLINE, I decided to rebuild
the Replication procedure again but it was a long process. It takes around 20 hrs
to rebuild the entire procedure because the size of the database
is 140 GB and we have 100 GB of indexes on the Reporting database which
are placed on a separate SAN of RAID 10 for better performance of the SSRS
reports. Even though I rebuilt the entire procedure again, we still had an
outage for a period of around 10 hrs. I decided to work on this case in much
more depth and try to find a more efficient solution.
If a T-SQL is executed against the publisher database which performs a large Insert/Update or a Delete operation for e.g. in our case the number of records impacted were around 3500000, following error can occur:
The process could not execute 'sp_replcmds' on 'xxxxxx'. Status: 2, code: 0, text: 'Timeout expired' A time out occurred while waiting for memory resources to execute the query. Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job history in the Jobs folder for more details. The step was cancelled (stopped) as the result of a stop job request
The Replication topology works according to the values which have been set for the particular agent. There are 5 types of agents which the Transactional Replication procedure uses:
- Distribution Agents
- Merge Agents
- Log Reader Agents
- Snapshot Agents
- Queue Reader Agents
In order to view the same, please perform the following steps:
- Connect to SQL Server Management Studio on the OLTP instance
- Right Click on the Replication Node in the
Object Explorer in the SQL Server Management Studio and click on Launch
Replication Monitor, as below.
- Right Click on the Publisher Name and Select
Agent Profiles, as below.
- You should be able to view the below screen.
- Each of these Agents has a Default Agent Profile
associated with it. Let’s have a look at the Default Agent Profile for the
Distribution Agent as shown in the screen capture below:
- The important here is the Query Timeout parameter which is 1800 seconds (30 minutes)
- For Merge Agent the Query Timeout value is 300
seconds as shown below:
For all the other Agent Profiles it is 1800 seconds
Thus, if the operation associated
with these particular agents is not completed within the time specified
in the Query Timeout box, the Replication procedure will halt with the error
Query Timeout Expired and it will terminate.
In order to overcome this, instead of using the Default Agent profile for the corresponding Replication Agent, I decided to create a new profile for each of them. Please note that you cannot modify the default agent profile as it is generated by the system.
I created a new profile named Satnam_Profile for each of the Replication Agents and specified the Query Timeout value as 65533 seconds which is the maximum value, as shown below:
Once done for a particular Agent profile, I then clicked on Change Existing Agents and then Clicked OK as shown in the screen capture below.
This procedure was repeated for each of the Replication Agent profiles.
Once this change was made during the next maintainence window when a heavy DELETE statement was executed against the OLTP database which impacted around 1500000 records, the following observations were made.
- On the Publisher the deletion script took 7 min and 16 seconds to delete 1500000 records
- The Log Reader Agent took around 8-10 minutes to capture the data
- On the subscriber, the synchronisation Agent took around 12-15 minutes to replicate the data successfully without any issues
Therefore with the above changes the Replication procedure worked without any issues.
This procedure will also ensure that if you are performing an ONLINE Rebuilding and Update Statistics operation against a very busy database the Replication procedure will not be halted. Sometimes it has been observed that few indexes are so big in size such that they never get rebuilt within the specified query timeout error time as 30 minutes due to which the data doesn’t gets delivered onto the subscriber properly.
This was one such approach which I used to ensure that even though I am performing a huge DML operation against a particular database, the Transactional Replication procedure shall not get impacted at all. If there are any suggestions, feedback then please do let us know. Alternatively you can email me on email@example.com.