SQL Server Performance

Script that generates delete duplications query.

Discussion in 'Contribute Your Performance and Clustering Tips' started by bambola, Jul 9, 2003.

  1. bambola New Member

    This script will generate a query to delete duplications. Currently it is working on table with an IDENTITY column, but it should work with any other unique value. @uniqueColumns is the column that should be unique (would change it to accept a list of columns)


    DECLARE
    @dbName sysname
    ,@tableName sysname
    ,@uniqueColumns varchar(1000)

    SELECT
    @dbName = 'northwind'
    ,@tableName = 'categories'
    ,@uniqueColumns = 'categoryname'

    SET NOCOUNT ON

    -- Checking Database name parameter
    IF NOT EXISTS
    (
    SELECT NULL
    FROM INFORMATION_SCHEMA.SCHEMATA
    WHERE CATALOG_NAME NOT IN ('master','model','msdb','tempdb')
    AND CATALOG_NAME = @dbName
    )
    BEGIN
    PRINT 'Database "' + @dbname + '" does not exist on this server.' + CHAR(13)
    END

    DECLARE @tableID int
    SELECT @tableID = OBJECT_ID(@tableName)

    -- Checking table name parameter
    IF NOT EXISTS
    (
    SELECT NULL
    FROM INFORMATION_SCHEMA.TABLES
    WHERE OBJECTPROPERTY(@tableID,N'IsUserTable') = 1
    AND OBJECTPROPERTY(@tableID, N'IsMSShipped') = 0
    AND TABLE_NAME = @tableName
    )
    BEGIN
    PRINT 'A user table "' + UPPER(@tableName) + '" does not exist in database "'
    + UPPER(@dbName) + '".' + CHAR(13)
    END
    DECLARE @IdentityColumn sysname
    ,@ColumnList varchar(4000)

    SELECT @IdentityColumn = COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_CATALOG = @dbName
    AND TABLE_NAME = @tableName
    AND DATA_TYPE = 'INT'
    AND COLUMNPROPERTY(@tableId,COLUMN_NAME,'IsIdentity') = 1

    IF @@ROWCOUNT = 0
    BEGIN
    PRINT 'Table "' + @tableName + '" has no indetity column.'
    END

    SELECT @ColumnList = COALESCE(@ColumnList + ', ','') + COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_CATALOG = @dbName
    AND TABLE_NAME = @tableName

    DECLARE @SqlString varchar(8000)
    SET @SqlString = ''
    SELECT @SqlString =
    '
    SELECT ' + @columnList + '
    FROM ' + @tableName + '
    WHERE EXISTS
    (
    SELECT NULL
    FROM ' + @tableName + ' a WITH (NOLOCK)
    WHERE ' + @tableName + '.' + @uniqueColumns + ' = ' + ' a.' + @uniqueColumns + '
    AND ' + @tableName + '.' + @IdentityColumn + ' < a.' + @IdentityColumn + '
    )


    -- To delete replace the first 2 lines with:
    -- DELETE FROM ' + @tableName + '

    -- To select all NON dulpicated rows change WHERE EXISTS to
    -- WHERE NOT EXISTS
    '

    PRINT @SqlString
    EDIT: as suggested by Adriaan to remove the HAVING clause.
    Bambola.
  2. gaurav_bindlish New Member

    The string can be formed in this way as well...

    SET ROWCOUNT 1

    SELECT @SqlString =
    '
    DELETE
    FROM ' + @tableName + '
    WHERE COUNT(*) > 1
    GROUP BY ' + @tableName + '.' + @uniqueColumns

    SET ROWCOUNT 0
    Since rowcount will delete just 1 row, so this will be useful when there will be only one duplicate row. Else the script will need to execute over and over again.

    Gaurav
  3. bambola New Member

    I think not.
    Did you try to run it? It should give you an error about not being able to use
    agg. function in the where clause.
    And I did not understand the use of ROWCOUNT 1. Basically the way it is written,
    the query will simply delete one row from the table. Not one for each duplicated row.

    Bambola.

  4. bambola New Member

    I will try to explain better.


    SET NOCOUNT ON
    -- Creating table Categories_dups
    CREATE TABLE [Categories_dups] (
    [CategoryName] [nvarchar] (15) NOT NULL ,
    [Description] [ntext] NULL ,
    [Picture] [image] NULL ,
    )
    GO
    -- pupulating table with distinct value
    INSERT INTO categories_dups
    SELECT [CategoryName], [Description], [Picture]
    FROM categories
    GO
    -- inserting duplicated rows.
    INSERT INTO categories_dups
    SELECT TOP 5 [CategoryName], [Description], [Picture]
    FROM categories
    ORDER BY NEWID()
    GO
    INSERT INTO categories_dups
    SELECT TOP 5 [CategoryName], [Description], [Picture]
    FROM categories
    ORDER BY NEWID()
    GO
    -- Adding an IDENTITY to table
    ALTER TABLE categories_dups
    ADD CategoryID INT IDENTITY (1,1)

    GO

    -- running the query generated by the script above.
    -- This are all the duplicated rows.
    -- if you change EXISTS to NOT EXSITS you will see all distinct rows.

    -- Basically, this query is comparing the table to itself (self join). It is selecting
    -- all the rows that has the same value in @uniqueColumns in both tables, but with different
    -- IDENTITY values, and getting only one of them.

    SELECT CategoryName, Description, Picture, CategoryID
    FROM categories_dups
    WHERE EXISTS
    (
    SELECT NULL
    FROM categories_dups a WITH (NOLOCK)
    WHERE categories_dups.categoryname = a.categoryname
    AND categories_dups.CategoryID < a.CategoryID
    )

    -- Cleanup
    -- DROP TABLE Categories_dups

    I hope it is more clear now.

    I posted it in the wrong forum. It should be moved to "Contribute Your SQL Server Scripts". Sorry!

    Bambola.
  5. Adriaan New Member

    If there is a unique identifier in the table (every table should have one)
    then we can simply use a DELETE query with an EXISTS clause, like this:

    DELETE FROM TestTable
    WHERE EXISTS
    (SELECT TMP.ID
    FROM TestTable AS TMP
    WHERE TMP.vchFirstName = TestTable.vchFirstName
    AND TMP.vchLastName = TestTable.vchLastName
    AND TMP.vchAddress1 = TestTable.vchAddress1
    AND TMP.ID < TestTable.ID)

    If you want to retain only the last inserted duplicate instead of the first
    one, change < to >.

    More fun:
    If you want to retain the duplicates and delete the unique records, use NOT
    EXISTS with <, > or <>.
    If you want to retain only the unique records and delete all duplicates, use
    EXISTS with <>.

    The following script creates a simple table, inserts some rows ('AAAA' 3x,
    'AAAD' 2x) and then deletes the duplicates except the first one that was
    entered:

    ------------------------------------------
    SET NOCOUNT ON

    CREATE TABLE #TestTable
    (TestID INT IDENTITY PRIMARY KEY,
    TestString VARCHAR(4))

    INSERT INTO #TestTable (TestString) VALUES ('AAAA')
    INSERT INTO #TestTable (TestString) VALUES ('AAAA')
    INSERT INTO #TestTable (TestString) VALUES ('AAAB')
    INSERT INTO #TestTable (TestString) VALUES ('AAAC')
    INSERT INTO #TestTable (TestString) VALUES ('AAAD')
    INSERT INTO #TestTable (TestString) VALUES ('AAAA')
    INSERT INTO #TestTable (TestString) VALUES ('AAAD')

    DELETE FROM #TESTTABLE
    WHERE EXISTS
    (SELECT TMP.TestID
    FROM #TestTable AS TMP
    WHERE TMP.TestString = #TESTTABLE.TestString
    AND TMP.TestID < #TESTTABLE.TestID)

    SELECT * FROM #TESTTABLE

    DROP TABLE #TESTTABLE
    ------------------------------------------
  6. bambola New Member

    This is exactly what I am doing! <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br />I'll use your table for example. <br /><pre><br />CREATE TABLE TestTable<br />(TestID INT IDENTITY PRIMARY KEY,<br />TestString VARCHAR(4))<br /><br />INSERT INTO TestTable (TestString) VALUES ('AAAA')<br />INSERT INTO TestTable (TestString) VALUES ('AAAA')<br />INSERT INTO TestTable (TestString) VALUES ('AAAB')<br />INSERT INTO TestTable (TestString) VALUES ('AAAC')<br />INSERT INTO TestTable (TestString) VALUES ('AAAD')<br />INSERT INTO TestTable (TestString) VALUES ('AAAA')<br />INSERT INTO TestTable (TestString) VALUES ('AAAD')<br /><br />Now if I run the script above, changing the parameters to:<br /><br />SELECT @dbName = 'database_name' -- change to your database name<br /> , @tableName = 'TestTable'<br /> , @uniqueColumns = 'TestString'<br /><br />I will have the following output:<br /><br /> SELECT TestID, TestString<br /> FROM TestTable<br /> WHERE EXISTS<br /> (<br /> SELECT NULL<br /> FROM TestTable a WITH (NOLOCK)<br /> WHERE TestTable.TestString = a.TestString<br /> HAVING TestTable.TestID &lt; MAX(a.TestID)<br /> )<br /><br /> -- To delete replace the first 2 lines with:<br /> -- DELETE FROM TestTable<br /><br /> -- To select all NON dulpicated rows change WHERE EXISTS to <br /> -- WHERE NOT EXISTS<br /></pre><br />It looks like the query you suggested <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br />You see what you end up writing when you are so LAZY??? [<img src='/community/emoticons/emotion-2.gif' alt=':D' />]<br /><br />Bambola.<br />
  7. Adriaan New Member

    Oooops, didn't mean to invade ...

    I was actually looking for a place to respond to the article featured on the start page, "How to Delete SQL Server Database Duplicate Rows ...", which used a cursor instead of being set-based. The thread title suggested this was a spin-off, so I replied after just glancing over the posts.

    So you can use the HAVING clause and the MAX() function without a GROUP BY - neat trick - but if EXISTS is so fast because it stops when the first match is found, how would calculating an aggregate improve performance?

    Adriaan
  8. Adriaan New Member

    I've just had Query Analyzer estimate the execution plan for the two queries in a batch, and yours contributes easily twice as much to the query costs ...
  9. bambola New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">You see what you end up writing when you are so LAZY??? <hr height="1" noshade id="quote"></blockquote id="quote"></font id="quote"><br />With the line above I was OBVIOUSLY refering to myself!<br />should have been "You see what you end up writing when you are as LAZY as me..."<br />Sorry if it sounded different <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br /><br />Bambola.<br /><br />
  10. bambola New Member

    You are 100% right!
    There's no need to use HAVING. I'll correct my initial post.

    Bambola.
  11. Adriaan New Member

    With these temperatures we're having I'm not calling anyone lazy. But if you don't mind I'll just nod off before I start testing all your other posts <img src='/community/emoticons/emotion-5.gif' alt=';-)' /><br /><br />Adriaan
  12. bambola New Member

    GO ahead, be my guest.
    we might need to start another forum for that though, and I'm not sure the server would
    be able to handle the load... LOL

    Bambola.

Share This Page