SQL Server Performance

Duplicated rows.

Discussion in 'T-SQL Performance Tuning for Developers' started by alex, Sep 15, 2004.

  1. alex New Member

    Hello, friends.
    You have been always such a great help.
    Please, help me with another problem.

    I have a table1(value1,value2,value3,date). Combination of (value1,value2) is a primary key, but it wasn#%92t set up. I have duplication of combination (value1,value2) records with value3 and date being different. I would like to keep only one record and delete the second one with the earlier date.
    Table1 look like it:

    a b c date
    a b c1 date1 --should be deleted
    a1 b1 c date
    a1 b1 c1 date1 --should be deleted
    a2 b2 c date
    a2 b2 c1 date ----should be deleted

    Thanks again for your help.
    Alex.


  2. derrickleggett New Member

    How do you know which one of the records should be deleted. Is it random, or is there some logic you can apply to value3 and date to select the appropriate record?

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  3. Adriaan New Member

    From the sampe data, it looks as if (value1, value2, value3) could also be the primary key.

    In any case, you need a field that is guaranteed to have a different value on different rows with the same (value1, value2) key, in order to remove the duplicates in one go. You could add an IDENTITY column for that purpose - I've named it NEW_ID in the query below. You can drop de NEW_ID afterwards if you want to.

    Do this on a copy of the data first and check the results, before committing.

    DELETE FROM table1
    WHERE EXISTS
    (SELECT TMP.value1 FROM table1 AS TMP
    WHERE TMP.value1 = table1.value1
    AND TMP.value2 = table1.value2
    AND TMP.NEW_ID <> table1.NEW_ID
    AND TMP.date <= table1.date)
  4. Adriaan New Member

    Derrick,

    I think the point was to delete all records that have the same (value1, value2) key, keeping only the one with the earliest date value.

    You need the unique identifier to avoid deleting all rows with (value1, value2) that have the earliest date value.
  5. derrickleggett New Member

    Yeah, I just read that. Sorry about missing that one the first time through.

    MeanOldDBA
    derrickleggett@hotmail.com

    When life gives you a lemon, fire the DBA.
  6. sundeip New Member

    U Can Try This Script Also. If u dont want to create identity column in Your table.

    /*
    Drop Table TmpA
    Create Table TmpA (Id Varchar(2),Id1 Varchar(2),Col1 Varchar(2),Dat DateTime)
    Insert Into TmpA Select 'a', 'b', 'c', GetDate()-1
    Insert Into TmpA Select 'a', 'b', 'c1', GetDate()-2 --should be deleted
    Insert Into TmpA Select 'a1', 'b1', 'c', GetDate()-3
    Insert Into TmpA Select 'a1', 'b1', 'c1', GetDate()-4 --should be deleted
    Insert Into TmpA Select 'a2', 'b2', 'c', GetDate()-5
    Insert Into TmpA Select 'a2', 'b2', 'c1', GetDate()-5 ----should be deleted
    */
    --Select Id,ID1,Sum(1) From TmpA Group By Id,ID1 Having Sum(1)>1
    --select * from TmpA

    Declare @ID Varchar(2)
    Declare @ID1 Varchar(2)
    Declare @IDx Varchar(2)
    Declare @IDx1 Varchar(2)
    Declare @IsFirstRecord Varchar(1)
    Declare Xyz Cursor For Select Id,ID1 From TmpA Group By Id,ID1 Having Sum(1)>1
    OPEN Xyz
    --Gets The Primary Key Field To Filter the Duplicate Records
    FETCH NEXT FROM Xyz Into @ID,@ID1
    WHILE @@FETCH_STATUS = 0
    Begin
    Select @IsFirstRecord=1 -- 1 Indicates First Record / 0 - Dup records
    Print @ID + @ID1
    Print @IsFirstRecord

    --Generate Statement 2 Delet Data
    Declare Abc Cursor OPTIMISTIC For Select ID,ID1 From TMPA Where Id=@ID And ID1= @ID1
    OPEN abc
    FETCH NEXT FROM abc Into @Idx,@Idx1
    WHILE @@FETCH_STATUS = 0
    BEGIN
    If @IsFirstRecord = 1
    Begin
    Set @IsFirstRecord=0
    FETCH NEXT FROM abc Into @Idx,@Idx1
    End
    Else
    Begin
    Delete From TmpA Where Current Of Abc
    FETCH NEXT FROM abc Into @Idx,@Idx1
    Print @Idx + @Idx1
    Print 'Del'
    End


    END
    CLOSE abc
    DEALLOCATE abc
    --Gets The Next Primary Key Field To Filter the Duplicate Records
    FETCH NEXT FROM Xyz Into @ID,@ID1

    End
    CLOSE Xyz
    DEALLOCATE Xyz


    Thanks
    Sandy
  7. Adriaan New Member

    Sandy,

    From the looks of it, your script gets the job done, and without adding a field. But as I said, if there is another field value3 that makes (value1, value2, value3) a unique key, then you could compare TMP.value3 <> table1.value3 and get the same results - for much less programming. If the table holds a significant number of rows then this is probably a lot faster too.

    I use cursors only when there is no other way around it.
  8. FrankKalis Moderator

    If I understand this correct, there is absolutely no need for a cursor. Something like this should also get the job done. I assume Adriaan meant the same.


    Create Table TmpA (Id Varchar(2),Id1 Varchar(2),Col1 Varchar(2),Dat DateTime)
    Insert Into TmpA Select 'a', 'b', 'c', GetDate()-1
    Insert Into TmpA Select 'a', 'b', 'c1', GetDate()-2 --should be deleted
    Insert Into TmpA Select 'a1', 'b1', 'c', GetDate()-3
    Insert Into TmpA Select 'a1', 'b1', 'c1', GetDate()-4 --should be deleted
    Insert Into TmpA Select 'a2', 'b2', 'c', GetDate()-5
    Insert Into TmpA Select 'a2', 'b2', 'c1', GetDate()-6 ----should be deleted
    -- the 5 in Sandy's script is likely to be a typo, right?

    SELECT * FROM tmpA

    DELETE FROM tmpA
    WHERE EXISTS
    (SELECT * FROM tmpA AS T
    WHERE T.id = tmpA.id
    AND T.id1 = tmpA.Id1
    AND T.Dat < tmpA.Dat)
    SELECT * FROM tmpA
    drop table tmpa

    Id Id1 Col1 Dat
    ---- ---- ---- ------------------------------------------------------
    a b c 2004-09-14 15:42:17.697
    a b c1 2004-09-13 15:42:17.707
    a1 b1 c 2004-09-12 15:42:17.707
    a1 b1 c1 2004-09-11 15:42:17.707
    a2 b2 c 2004-09-10 15:42:17.707
    a2 b2 c1 2004-09-09 15:42:17.707

    (6 row(s) affected)


    (3 row(s) affected)

    Id Id1 Col1 Dat
    ---- ---- ---- ------------------------------------------------------
    a b c 2004-09-14 15:42:17.697
    a1 b1 c 2004-09-12 15:42:17.707
    a2 b2 c 2004-09-10 15:42:17.707

    (3 row(s) affected)

    --
    --Frank
    http://www.insidesql.de
  9. Adriaan New Member

    Frank,<br /><br />Basically correct, but if there are two records with the same (id, id1) key that have the same date value, then neither one is deleted. That's why you need both the third field for the key and the &gt;= operator (instead of &gt<img src='/community/emoticons/emotion-5.gif' alt=';)' />.<br /><br />Adriaan
  10. FrankKalis Moderator

    quote:
    I have a table1(value1,value2,value3,date). Combination of (value1,value2) is a primary key, but it wasn#%92t set up. I have duplication of combination (value1,value2) records with value3 and date being different. I would like to keep only one record and delete the second one with the earlier date.
    One might assume from reading this, that your point cannot happen. Otherwise the question was imprecise.

    --
    --Frank
    http://www.insidesql.de
  11. Adriaan New Member

    Cannot happen? The problem is that we don't know if his real data contains duplicates on (Id, Id1, Dat). If that is the case, then neither of those rows will get deleted (because of the < operator). And then he still has duplicate rows for (Id, Id1) ...
  12. FrankKalis Moderator

    Adriaan,<br />have you read what I quoted from the original question and marked in bold?<br />From what I read there, I assumed that the combination of id, id1 <b>can</b> have duplicated, while the combination of id, id1 and dat <b>cannot</b> have duplicates. <br />Of course, I might be wrong on this [<img src='/community/emoticons/emotion-1.gif' alt=':)' />]<br /><br />-----------------------<br />--Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />-----------------------<br />
  13. sundeip New Member

    Frank & Adrian,

    Thanks for the Inputs,Which helped me a lot.
    But I Used Cursor Just Bcoz , If Alex wants to add his some more where clauses Or Conditions .

    Thanks
    Sandy



  14. Adriaan New Member

    Frank,

    As you said, the original question wasn't really clear - he says "value3 and date being different", which suggests that only the combination of all four fields (value1, value2, value3, date) is unique.

    We just don't know for sure if (value1, value2, date) is unique. If it is, then we can use your query.
  15. FrankKalis Moderator

    Yeah, who knows in the end. ..and the ultimate answer is 42, iirc.
    Anyway, alex should now have some ideas how to deal with the problem.

    How do you say "Friede, Freude, Eierkuchen" in Dutch?


    -----------------------
    --Frank
    http://www.insidesql.de
    -----------------------
  16. Adriaan New Member

    "Vrede, vreugde, eierkoeken" - hm, has a nice ring to it! There has to be a similar saying in Dutch, just cannot think of one at the moment (you know I can't think straight when I'm angry [<img src='/community/emoticons/emotion-6.gif' alt=':(' />!][}<img src='/community/emoticons/emotion-1.gif' alt=':)' />][<img src='/community/emoticons/emotion-5.gif' alt=';)' />][<img src='/community/emoticons/emotion-2.gif' alt=':D' />]).
  17. FrankKalis Moderator

    Hopefully it wasn't me who upset you [<img src='/community/emoticons/emotion-2.gif' alt=':D' />]<br />Anyway, I like the sequence of smilies [<img src='/community/emoticons/emotion-1.gif' alt=':)' />]<br /><br />Oh my, doing time waiting to go home today...<br /><br />-----------------------<br />--Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />-----------------------<br />
  18. FrankKalis Moderator

    Hey, here's a new one from my smilies collection<br /><img src='http://www.insidesql.de/images/PcPunch.gif' border='0' /><br /><br />-----------------------<br />--Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />-----------------------<br />
  19. alex New Member

    Thank you, friends - you are all geniuses!

    Alex.
  20. Madhivanan Moderator

    Hi FrankKalis, I used the following query


    select t1.* from tmpa t1,tmpa t2 where t1.id=t2.id and t1.id1 = t2.Id1 and t2.dat<t1.dat
    which produced the required result

    Which is best in performance?

    Madhivanan

Share This Page