Update/delete large table with duplicated records | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Update/delete large table with duplicated records

Hello all.
I wanted to flag ducplicated records in
a large table (more than 100.000 records)
so I tried to use the next script. The problem is that it took a large amount of
times. Any ideas to optimize it? Is there another way to flag or delete duplicated
records? Thanks Update Large_table
set field_to_Flag= 1
WHERE
EXISTS (
SELECT
NULL
FROM Large_table b
WHERE Large_table.Field1 = Large_table.Field1
AND b.Field2 = Large_table.Field2
AND b.Field3 = Large_table.Field3 GROUP BY b.Field1, b.Field2, b.Field3
HAVING Large_table.id < MAX(b.id)
)
Have you viewed an excution plan of the query to see what is going on internally? For example, perhaps adding one or more indexes might help this query. An execution plan will tell you is indexes are being used or not. ——————
Brad M. McGehee
Webmaster
SQL-Server-Performance.Com
This table dot not have index at all.
I just added unique identifier field.
Will you be more explicit about your
solution?

Another precision:
If I add indexes on this table,
the main software(3rd party) won’t work.
Im not sure I understand why the app wouldnt work. Adding an non-constrained index should be transparent to any application using the table, unless the table is being recreated etc. Also, the index is there to improve performance, not to prevent duplicates. If you cant create any indices at all, Id guess youre going to struggle to improve performance much. Alternatively, perhaps you could create a brand new table with the same schema, with an ‘ignore duplicates’ index on Field1,Field2,Field3 and then do a select into from your existing table into the new one. This would be ok for a one-off process but not really something I would want to do regularly.
Without indexes, the only option SQL Server has is to perform index scans, which often hurt performance. Like Chappy says, I don’t see how adding an index will prevent third-party software from working. We are always adding indexes databases accessed by third party software and I have never seen a problem. Can you be more specific? ——————
Brad M. McGehee
Webmaster
SQL-Server-Performance.Com
I have had more luck with large tables not using EXISTS in the where clause! This type of query tends to run more effiently and quicker. UPDATE LARGE_TABLE
SET field_to_flag = 1
FROM LARGE_TABLE
JOIN ( SELECT LARGE_TABLE.Field1,
LARGE_TABLE.Field2,
LARGE_TABLE.Field3,
MAX(LARGE_TABLE.id) as ID
FROM LARGE_TABLE
GROUP BY
LARGE_TABLE.Field1,
LARGE_TABLE.Field2,
LARGE_TABLE.Field3 ) AS A ON A.Field1 = LARGE_TABLE.Field1
AND A.Field2 = LARGE_TABLE.Field2
AND A.Field3 = LARGE_TABLE.Field3
WHERE LARGE_TABLE.id <> A.ID
To identify duplicate rows: Select count(distinct(field1, field2,field3) from large_table having count(distinct(field1, field2,field3) > 1 If you only have a few you can run this query and then take corrective action on a row by row basis
Or if you want to know which values are duplicated, run select column1 from table
group by column1
having count(column1) > 1 Tom Pullen
DBA, Oxfam GB
]]>