Duplicated rows. | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Duplicated rows.

Hello, friends.
You have been always such a great help.
Please, help me with another problem. I have a table1(value1,value2,value3,date). Combination of (value1,value2) is a primary key, but it wasn#%92t set up. I have duplication of combination (value1,value2) records with value3 and date being different. I would like to keep only one record and delete the second one with the earlier date.
Table1 look like it: a b c date
a b c1 date1 –should be deleted
a1 b1 c date
a1 b1 c1 date1 –should be deleted
a2 b2 c date
a2 b2 c1 date —-should be deleted Thanks again for your help.
Alex.

How do you know which one of the records should be deleted. Is it random, or is there some logic you can apply to value3 and date to select the appropriate record? MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.
From the sampe data, it looks as if (value1, value2, value3) could also be the primary key. In any case, you need a field that is guaranteed to have a different value on different rows with the same (value1, value2) key, in order to remove the duplicates in one go. You could add an IDENTITY column for that purpose – I’ve named it NEW_ID in the query below. You can drop de NEW_ID afterwards if you want to. Do this on a copy of the data first and check the results, before committing. DELETE FROM table1
WHERE EXISTS
(SELECT TMP.value1 FROM table1 AS TMP
WHERE TMP.value1 = table1.value1
AND TMP.value2 = table1.value2
AND TMP.NEW_ID <> table1.NEW_ID
AND TMP.date <= table1.date)

Derrick, I think the point was to delete all records that have the same (value1, value2) key, keeping only the one with the earliest date value. You need the unique identifier to avoid deleting all rows with (value1, value2) that have the earliest date value.
Yeah, I just read that. Sorry about missing that one the first time through. MeanOldDBA
[email protected] When life gives you a lemon, fire the DBA.
U Can Try This Script Also. If u dont want to create identity column in Your table. /*
Drop Table TmpA
Create Table TmpA (Id Varchar(2),Id1 Varchar(2),Col1 Varchar(2),Dat DateTime)
Insert Into TmpA Select ‘a’, ‘b’, ‘c’, GetDate()-1
Insert Into TmpA Select ‘a’, ‘b’, ‘c1’, GetDate()-2 –should be deleted
Insert Into TmpA Select ‘a1’, ‘b1’, ‘c’, GetDate()-3
Insert Into TmpA Select ‘a1’, ‘b1’, ‘c1’, GetDate()-4 –should be deleted
Insert Into TmpA Select ‘a2’, ‘b2’, ‘c’, GetDate()-5
Insert Into TmpA Select ‘a2’, ‘b2’, ‘c1’, GetDate()-5 —-should be deleted
*/
–Select Id,ID1,Sum(1) From TmpA Group By Id,ID1 Having Sum(1)>1
–select * from TmpA Declare @ID Varchar(2)
Declare @ID1 Varchar(2)
Declare @IDx Varchar(2)
Declare @IDx1 Varchar(2)
Declare @IsFirstRecord Varchar(1)
Declare Xyz Cursor For Select Id,ID1 From TmpA Group By Id,ID1 Having Sum(1)>1
OPEN Xyz
–Gets The Primary Key Field To Filter the Duplicate Records
FETCH NEXT FROM Xyz Into @ID,@ID1
WHILE @@FETCH_STATUS = 0
Begin
Select @IsFirstRecord=1 — 1 Indicates First Record / 0 – Dup records
Print @ID + @ID1
Print @IsFirstRecord –Generate Statement 2 Delet Data
Declare Abc Cursor OPTIMISTIC For Select ID,ID1 From TMPA Where [email protected] And ID1= @ID1
OPEN abc
FETCH NEXT FROM abc Into @Idx,@Idx1
WHILE @@FETCH_STATUS = 0
BEGIN
If @IsFirstRecord = 1
Begin
Set @IsFirstRecord=0
FETCH NEXT FROM abc Into @Idx,@Idx1
End
Else
Begin
Delete From TmpA Where Current Of Abc
FETCH NEXT FROM abc Into @Idx,@Idx1
Print @Idx + @Idx1
Print ‘Del’
End
END
CLOSE abc
DEALLOCATE abc
–Gets The Next Primary Key Field To Filter the Duplicate Records
FETCH NEXT FROM Xyz Into @ID,@ID1 End
CLOSE Xyz
DEALLOCATE Xyz
Thanks
Sandy

Sandy, From the looks of it, your script gets the job done, and without adding a field. But as I said, if there is another field value3 that makes (value1, value2, value3) a unique key, then you could compare TMP.value3 <> table1.value3 and get the same results – for much less programming. If the table holds a significant number of rows then this is probably a lot faster too. I use cursors only when there is no other way around it.
If I understand this correct, there is absolutely no need for a cursor. Something like this should also get the job done. I assume Adriaan meant the same.
Create Table TmpA (Id Varchar(2),Id1 Varchar(2),Col1 Varchar(2),Dat DateTime)
Insert Into TmpA Select ‘a’, ‘b’, ‘c’, GetDate()-1
Insert Into TmpA Select ‘a’, ‘b’, ‘c1’, GetDate()-2 –should be deleted
Insert Into TmpA Select ‘a1’, ‘b1’, ‘c’, GetDate()-3
Insert Into TmpA Select ‘a1’, ‘b1’, ‘c1’, GetDate()-4 –should be deleted
Insert Into TmpA Select ‘a2’, ‘b2’, ‘c’, GetDate()-5
Insert Into TmpA Select ‘a2’, ‘b2’, ‘c1’, GetDate()-6 —-should be deleted
— the 5 in Sandy’s script is likely to be a typo, right? SELECT * FROM tmpA DELETE FROM tmpA
WHERE EXISTS
(SELECT * FROM tmpA AS T
WHERE T.id = tmpA.id
AND T.id1 = tmpA.Id1
AND T.Dat < tmpA.Dat)
SELECT * FROM tmpA
drop table tmpa Id Id1 Col1 Dat
—- —- —- ——————————————————
a b c 2004-09-14 15:42:17.697
a b c1 2004-09-13 15:42:17.707
a1 b1 c 2004-09-12 15:42:17.707
a1 b1 c1 2004-09-11 15:42:17.707
a2 b2 c 2004-09-10 15:42:17.707
a2 b2 c1 2004-09-09 15:42:17.707 (6 row(s) affected)
(3 row(s) affected) Id Id1 Col1 Dat
—- —- —- ——————————————————
a b c 2004-09-14 15:42:17.697
a1 b1 c 2004-09-12 15:42:17.707
a2 b2 c 2004-09-10 15:42:17.707 (3 row(s) affected) —
–Frank
http://www.insidesql.de

Frank,<br /><br />Basically correct, but if there are two records with the same (id, id1) key that have the same date value, then neither one is deleted. That’s why you need both the third field for the key and the &gt;= operator (instead of &gt<img src=’/community/emoticons/emotion-5.gif’ alt=’;)‘ />.<br /><br />Adriaan
quote:
I have a table1(value1,value2,value3,date). Combination of (value1,value2) is a primary key, but it wasn#%92t set up. I have duplication of combination (value1,value2) records with value3 and date being different. I would like to keep only one record and delete the second one with the earlier date.
One might assume from reading this, that your point cannot happen. Otherwise the question was imprecise. —
–Frank
http://www.insidesql.de

Cannot happen? The problem is that we don’t know if his real data contains duplicates on (Id, Id1, Dat). If that is the case, then neither of those rows will get deleted (because of the < operator). And then he still has duplicate rows for (Id, Id1) …
Adriaan,<br />have you read what I quoted from the original question and marked in bold?<br />From what I read there, I assumed that the combination of id, id1 <b>can</b> have duplicated, while the combination of id, id1 and dat <b>cannot</b> have duplicates. <br />Of course, I might be wrong on this [<img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ />]<br /><br />———————–<br />–Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />———————–<br />
Frank & Adrian, Thanks for the Inputs,Which helped me a lot.
But I Used Cursor Just Bcoz , If Alex wants to add his some more where clauses Or Conditions . Thanks
Sandy
Frank, As you said, the original question wasn’t really clear – he says "value3 and date being different", which suggests that only the combination of all four fields (value1, value2, value3, date) is unique. We just don’t know for sure if (value1, value2, date) is unique. If it is, then we can use your query.
Yeah, who knows in the end. ..and the ultimate answer is 42, iirc.
Anyway, alex should now have some ideas how to deal with the problem. How do you say "Friede, Freude, Eierkuchen" in Dutch?
———————–
–Frank
http://www.insidesql.de
———————–

"Vrede, vreugde, eierkoeken" – hm, has a nice ring to it! There has to be a similar saying in Dutch, just cannot think of one at the moment (you know I can’t think straight when I’m angry [<img src=’/community/emoticons/emotion-6.gif’ alt=’:(‘ />!][}<img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ />][<img src=’/community/emoticons/emotion-5.gif’ alt=’;)‘ />][<img src=’/community/emoticons/emotion-2.gif’ alt=’:D‘ />]).
Hopefully it wasn’t me who upset you [<img src=’/community/emoticons/emotion-2.gif’ alt=’:D‘ />]<br />Anyway, I like the sequence of smilies [<img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ />]<br /><br />Oh my, doing time waiting to go home today…<br /><br />———————–<br />–Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />———————–<br />
Hey, here’s a new one from my smilies collection<br /><img src=’http://www.insidesql.de/images/PcPunch.gif’ border=’0′ /><br /><br />———————–<br />–Frank<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />———————–<br />
Thank you, friends – you are all geniuses! Alex.
Hi FrankKalis, I used the following query
select t1.* from tmpa t1,tmpa t2 where t1.id=t2.id and t1.id1 = t2.Id1 and t2.dat<t1.dat
which produced the required result Which is best in performance? Madhivanan
]]>