I had tried to run a delete on a table base on rows in another for archiving but the statement ran for over 3 hours and eventually failed because ran out of space in trans log. I am deleting 13.5 million rows from a table that has 22 million total row, 48 columns wide. This is the statement used: delete from <database>.dbo.<table> where id in (select id from <database_arch>.dbo.<table>) when I select on the same criteria ( select * from <database>.dbo.<table> where id in (select id from <database_arch>.dbo.<table>)) the results are returned in 22 and a half min. The size of the data being deleted is 4.6GB but when i was doing the delete the transaction log grew to over 60GB before running out of space. How much overhead is there for statements written to the trans log? I suppose I could change the recovery model to simple to avoid the translog running out of space but why would it grow that big to begin with? And why did it take so long? Would using join instead of subquery increase performance? Any comment to help me understand what happened would be great. Thanks.
That is what happened when you use full recovery model. Change to simple to do that and back to full after delete.
Deletion in batch / recovery modes / etc as stated are useful... mass deletions suck since they're record by record. Another tip: drop the indexes on the table beforehand so the DELETE doesn't have to update those... you'd pretty much have to rebuild them afterwards anyway so you're not losing much. Alternatively, use a SELECT INTO to a scratch table to save the rows you do want to keep, truncate or outright drop the original one, and copy the data back / rename, and finally recreate the indexes.
I don't think using a join would impact greatly (if at all) on the performance of your delete operation. Check the estimated execution plans for the subquery or join. You might be surprised to find that they are not very different. After the temporary change to a simple recovery model, It would help to do smaller delete operations on the table. Try something like: CREATE PROC DeleteTable @MaxID int AS DELETE from <database>.dbo.<table> WHERE id in (select id from <database_arch>.dbo.<table> WHERE id<@MAxID) RETURN You would then call the procedure many times over with increasing values of the @MaxID parameter: EXEC DeleteTable 500000 GO EXEC DeleteTable 1000000 GO EXEC DeleteTable 1500000 GO EXEC DeleteTable 2000000 GO
Do not change recovery model! Do not drop indexes! Deleting by batches if best one to go and intermittent transaction log backup will also ensure the transaction log size, in this case you coudl run this as a scheduled job for better performance as running from a client connection will have network latency. If you drop indexes and run the delete it will run quickly, but think about recreating the same index on the number of rows and during this time the table(s) will not be available to the application and you must ensure to reindex once the process is finished in order to cache the best plan. (Nathan, good to see you here again)
Thanks for all the suggestions. What I have decided with is to do batch deletions of 10000 at a time stopping every million to do trans log backup. would the following query be sufficient? declare @i int set @i = 0 while @i<1000000 BEGIN delete top (10000) from <database>.dbo.<table> where id in (select id from <database_arch>.dbo.<table>) IF @@rowcount < 10000 BREAK set @i = @i +10000 END
Sounds good, if the current transaction log can cope up such load, use DBCC SQLPERF(LOGSPACE) to see how the space is utilized on log side.