SQL Server Performance Forum – Threads Archive

duplicate rows but no key on the tables

Dear All, I have a table with 10 billion records but there are no key on it. I cannot build a key on it as it is the data source. However, the data source exits the duplicated rows. I have used the DTS to transform the data into a new table and delete the duplicated rows. As there are 10 billion records, i need to divide it into 3 parts and also the process lasts for 6 hours each part. I want to ask is there any other good methods to slove my problem?? Thx
Esther

I think you can go for bulk copy method to insert large nuber of rows.
It will be quicker.
I feel the method you are following is good.
create new table, select distinct rows and then rename the table.

Are you planning to create a PRIMARY KEY on the table after you’ve completed this operation? —
Frank Kalis
Microsoft SQL Server MVP
http://www.insidesql.de
Ich unterstÃ¼tze PASS Deutschland e.V. http://www.sqlpass.de)

Yes, i have created a PK for the table. is there any good suggest for me to improve the process? Thx Esther

Insert rows in batches of 5000 rows or so, to keep log size under control. For each 5000 rows insert only rows that don’t exist (or update existing ones, whatever you think is better). One way to do it would be something like:
set @lastID = 0
—
while @lastID is not null begin
select @LastID = max(id) from (select top 5000 from sourceTable where id > @lastID) as t
insert into destinationTable (…)
select top 5000 …
from sourceTable st
left join destinationTable dt on dt.id = st.id
where id > @lastID
and dt.id is null
end
You need to have clustered index on id on both source and destination tables.

create an identity cloumn on existing one table and delete duplicate record. Consider case: suppose, there is a table Emp having three columns Empname, Address, State.
Add one more field EmpId as identity field. Assume that this table has 10 billion record.
Now delete by using following sql statement delete Emp where empId not in (select min(empId) from Emp group by empname,address,state). After that u can transfer ur data either using DTS or creating Sp for inserting in batch.
Limitaion of batch u can defined.
skumar

]]>