Hello all. I wanted to flag ducplicated records in a large table (more than 100.000 records) so I tried to use the next script. The problem is that it took a large amount of times. Any ideas to optimize it? Is there another way to flag or delete duplicated records? Thanks Update Large_table set field_to_Flag= 1 WHERE EXISTS ( SELECT NULL FROM Large_table b WHERE Large_table.Field1 = Large_table.Field1 AND b.Field2 = Large_table.Field2 AND b.Field3 = Large_table.Field3 GROUP BY b.Field1, b.Field2, b.Field3 HAVING Large_table.id < MAX(b.id) )
Have you viewed an excution plan of the query to see what is going on internally? For example, perhaps adding one or more indexes might help this query. An execution plan will tell you is indexes are being used or not. ------------------ Brad M. McGehee Webmaster SQL-Server-Performance.Com
This table dot not have index at all. I just added unique identifier field. Will you be more explicit about your solution?
Im not sure I understand why the app wouldnt work. Adding an non-constrained index should be transparent to any application using the table, unless the table is being recreated etc. Also, the index is there to improve performance, not to prevent duplicates. If you cant create any indices at all, Id guess youre going to struggle to improve performance much. Alternatively, perhaps you could create a brand new table with the same schema, with an 'ignore duplicates' index on Field1,Field2,Field3 and then do a select into from your existing table into the new one. This would be ok for a one-off process but not really something I would want to do regularly.
Without indexes, the only option SQL Server has is to perform index scans, which often hurt performance. Like Chappy says, I don't see how adding an index will prevent third-party software from working. We are always adding indexes databases accessed by third party software and I have never seen a problem. Can you be more specific? ------------------ Brad M. McGehee Webmaster SQL-Server-Performance.Com
I have had more luck with large tables not using EXISTS in the where clause! This type of query tends to run more effiently and quicker. UPDATE LARGE_TABLE SET field_to_flag = 1 FROM LARGE_TABLE JOIN ( SELECT LARGE_TABLE.Field1, LARGE_TABLE.Field2, LARGE_TABLE.Field3, MAX(LARGE_TABLE.id) as ID FROM LARGE_TABLE GROUP BY LARGE_TABLE.Field1, LARGE_TABLE.Field2, LARGE_TABLE.Field3 ) AS A ON A.Field1 = LARGE_TABLE.Field1 AND A.Field2 = LARGE_TABLE.Field2 AND A.Field3 = LARGE_TABLE.Field3 WHERE LARGE_TABLE.id <> A.ID
To identify duplicate rows: Select count(distinct(field1, field2,field3) from large_table having count(distinct(field1, field2,field3) > 1 If you only have a few you can run this query and then take corrective action on a row by row basis
Or if you want to know which values are duplicated, run select column1 from table group by column1 having count(column1) > 1 Tom Pullen DBA, Oxfam GB