Not your typical duplicate rows question | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Not your typical duplicate rows question

Okay, so I’ll conceed that the best method of eliminating duplicate rows is to not allow them in the first place. That sounds candy, but I’m not sure it’s entirely possible (or very efficient) in my case. Here’s the situation: I receive text files from an external source on a daily basis. Some are 200 – 300 records, some are 200,000+. Sometimes these files contain duplicate records. Sometimes the duplicates are within a single file. Sometimes there is a record in "today’s" file that was also in a file sent 2, 3 or 10 days ago. My current process imports the entire file using BULK INSERT and assigns a unique key to each record as it imports. So, I have a bunch of rows with unique keys, but duplicate data values. I found a method that uses Max() or Min() functions on the unique key to remove duplicate data values. Unfortunately, my unique keys are guids (which, of course, you can’t use in aggregate functions). I currently have a stored procedure that uses a cursor to examine the table row by row and eliminate duplicates. This process takes about 2 hours or so when the table has 14 million records. After reading so much about the evils of cursors, I’m wondering if there isn’t a better way. Initial thoughts: 1. Put a table constraint on the table and don’t allow duplicates in the first place. The downside of this is that I assume the BULK INSERT will run much slower. I also kinda would like to know when I get duplicates and be able to report those back to our customers. 2. Put an autoincrement field on the table. Then I could allow duplicates and use the non-cursor based method of identifying and remove them. When importing 14 million records into a table, which would be less time and resource intensive? Having a table constraint to disallow duplicates in the first place or importing without a table constraint then removing duplicates later? If the answer is allow duplicates and remove them later, would adding an autoincrement field and avoiding a cursor based solution add that much performance? We’re talking 100,000 or so duplicates in a 14 million row table. I guess the obvious answer is to just try the different methods, which I will, but I thought I’d run it by the experts at the same time.
Why not insert those rows to a temp. table and eliminate duplicate from that table and finally use a DTS to import to final table. This way it will help to avoid cursors for slower performance. Satya SKJ
Moderator
http://www.SQL-Server-Performance.Com/forum
This posting is provided “AS IS” with no rights for the sake of knowledge sharing.
Thanks for the reply. As I stated in the original post, duplicate rows may come several days apart. Using your method would eliminate duplicates in a single day’s file, but would allow duplicate rows in data received on different days.
]]>