How to Delete SQL Server Database Duplicate Rows from a Table Using a Specified Column List and No Temp Tables

As much as we try and prevent it, duplicate data still finds its way into corporate data systems, but the good news is that purging duplicate rows can be a relatively simple task.

Prevention, through the use of good match code algorithms and constraints, will stop most duplicate data from entering your databases, but there are times in which duplicate data will still land in your data tables. Merging data from disparate systems is one way that duplicate data can enter a table. Inheriting a poorly designed database is another way in which you may receive duplicate data. (This author has inherited several poorly data systems.) The good news is that there is a relatively easy method of resolving the problem, but first a bit of history.

Years ago I worked for Microrim Incorporated, makers of R:BASE, a desktop relational database system. Among its many unique features, R:BASE had an extension to the SQL DELETE syntax as follows: DELETE DUPLICATES FROM tablename. This syntax would delete all duplicate rows from a table, leaving only one instance of each row. The operation was thorough, but very slow because the database engine had to compare every column value for a given row with every column value for every other row in the table. To help improve performance, an enhancement was later added to the DELETE syntax: DELETE DUPLICATES FROM tablename USING columnlist.

There are many obvious benefits gained by allowing developers to specify the list of columns to use for duplicate comparison, performance being the greatest. For example, suppose you have a table called CustomerTable and you determine that identical data in the columns CompanyName, FirstName, LastName and Address1 constitutes a duplicate.

Build a multi-column index using these four columns and then run the delete command: DELETE DUPLICATES FROM CustomerTable USING CompanyName, FirstName, LastName, Address1. When I started working with SQL Server I discovered, much to my dismay, that it had no comparable feature. We can, however, simulate the same action using some TRANSACT-SQL.

This script uses a CURSOR in conjunction with a GROUP BY colname HAVING COUNT(*) clause to build the list of duplicates. Choosing the list of columns to compare for duplicates is all that is required. We will use the same column list mentioned in the previous example: CompanyName, FirstName, LastName and Address1. In other words, if the combined value of CompanyName, FirstName, LastName and Address is identical in two or more rows, remove all but one copy of the row. I have included the COUNT function in the select list to capture the number of duplicates for a given row set. We will need this value for the DELETE clause. Here is our complete DECLARE CURSOR statement:

DECLARE DelDupe CURSOR FOR SELECT COUNT(*) AS Amount, vchFirstName, vchLastName, vchAddress1 FROM TestTable GROUP BY vchFirstName,vchLastName,vchAddress1 HAVING COUNT(*) > 1

Now that we’ve defined the duplicate row criteria and built the data set we’ll use the ROWCOUNT function and build a dynamic SQL DELETE statement.

As a reminder, our goal here is to retain one instance of each duplicate row, removing all other occurrences. The count function was included to capture the number of times a given data set repeated. If we subtract 1 from the count and then set the ROWCOUNT to this value then we will have the number of rows to delete for a given set of duplicates. We next need to convert our ROWCOUNT value to a character data type so that we can build it into our dynamic SQL statement. Here is the code to do that:

SELECT @iCount = @iCount – 1 SELECT @chCount = CONVERT(char(3),@iCount)

Now we can build our dynamic ROWCOUNT and DELETE statement:

SELECT @nvchCommand = N’SET ROWCOUNT ‘ + @chCount + ‘DELETE TestTable ‘ + ‘ WHERE vchCompanyName = ‘ + CHAR(34) + @vchCompanyName + CHAR(34) + ‘ AND vchFirstName = ‘ + CHAR(34) + @vchFirstName + CHAR(34) + ‘ AND vchLastName = ‘ + CHAR (34) + @vchLastName + CHAR(34) + ‘ AND vchAddress1 = ‘ + CHAR(34) + @vchAddress1 + CHAR(34)

And finally we can execute the statement:

EXEC sp_executesql @nvchCommand

The rest of the script is standard error checking and all the necessary code for looping through the CURSOR.

Continues…

Leave a comment Cancel reply