SQL Server Performance Forum – Threads Archive
Need Help Randomizing Customer DataHi, I have a quick question. I have a SQL Server database with customer data in 1 table. I want to use this ‘live’ data for a testing database, but need a way to randomize the customer info. I figured out ways to do it for SSN# and telephone #, as well as address, but I am stuck on the customer names themselves. The customer’s names are split between two columns, first_name and last_name within the same table. All I really want to do is update all the customer records with random combinations of the existing first and last names in the database so our ‘testers’ do not have access to real people names and their addresses/ssn/etc. Anyone know of a way to do this? So far, I have all the data imported into a temp table and was able to change randomize everything but the names. Thanks in advance
Here’s a starting point:
INSERT INTO customers ( firstName, lastName )
SELECT TOP 100000 first.name, last.name
FROM ( SELECT TOP 1000 [name] FROM sampdatafnames ORDER BY NEWID() ) [first]
CROSS JOIN ( SELECT TOP 1000 [name] FROM sampdatalnames ORDER BY NEWID() ) [last]
ORDER BY NEWID() I have used this to make random sample customers from a table of first names and a table of last names. You could make the source tables using something like
select distinct firstname into sampdatafnames from customer
Or this might be even closer (convert from select to update) — warning, it might be slow on a big table because of the cross join:
declare @maxid int
set @maxid = (select max(customerid) from customers)
select * from customers
inner join (
SELECT top( @maxid )first.firstname, last.lastname,
ROW_NUMBER() over ( order by newid() ) as customerid
FROM ( SELECT firstname FROM customers ) [first]
CROSS JOIN ( SELECT lastname FROM customers ) [last]
) randNames on customers.customerid = randnames.customerid
Here is another way of purely randomizing the last names.
Each time this script is run it will always yield random last names, thus making it unpredictable for testers. DECLARE @LastNameList TABLE
ListId INT IDENTITY(1,1),
) DECLARE @MaxListId INT
DECLARE @counter1 INT
INSERT INTO @LastNameList
SELECT CustomerId, LastName
FROM Customer SELECT @MaxListId = MAX(ListId) FROM @LastNameList SET @counter1 = 1 DECLARE @counter2 INT
SET @counter2 = 0 WHILE @counter2 in (0,1,2) OR @counter2 >= @MaxListId
SET @counter2 = RAND()*10
END DECLARE @NewLastName VARCHAR(10)
DECLARE @CurrentCustomerId INT WHILE @counter1 <= @MaxListId
BEGIN SELECT @CurrentCustomerId = CustomerId
WHERE ListId = @counter1 SELECT @NewLastName = LastName
WHERE ListId = @counter2 SET @counter1 = @counter1 + 1 IF @counter2 >= @MaxListId
SET @counter2 = 0
SET @counter2 = @counter2 + 1 UPDATE Customer
SET LastName = @NewLastName
CustomerId = @CurrentCustomerId
END Let me know if you have queries.
Thanks and Regards,
Replace with NEWID() – this returns a GUID like ‘D2D9B2CD-8246-4E1B-85BA-1348BF23F264’. Each call to NEWID() will return a completely different GUID. If you include NEWID() in any query, the GUID will be different on each row.
Hey, thanks for the help people. I was finally able to get my script to work somewhat, on small record sizes of 0-30,000. But the problem is, my database table that I am working on has 170,000 records and the script just keeps running. I am not really sure what to do or how to get it to finish without going on forever. Can anyone see anything sticking out at them that would make this run forever?<br /><br />Ill post my code below. Before this part of the code executes, I run a statement to select all the records into my temp table. Then I run this part to randomize the last names:<br /><br />CREATE TABLE #LastNameList <br />(<br />ListId INT IDENTITY(1,1),<br />patient_id char(<img src=’/community/emoticons/emotion-11.gif’ alt=’8)’ />,<br />last_name VARCHAR(20)<br />)<br /><br />DECLARE @MaxListId INT<br />DECLARE @counter1 INT<br />DECLARE @addy INT<br />DECLARE @mainstreet CHAR<br /><br /><br /><br />INSERT INTO #LastNameList <br />SELECT patient_id, last_name<br />FROM #temppat1<br /><br />SET @counter1 = 1<br />SET @addy = 100<br />SET @mainstreet = ‘Main Street'<br /><br />SELECT @MaxListId = MAX(ListId) FROM #LastNameList <br /><br />DECLARE @counter2 INT<br />SET @counter2 = 0<br /><br />WHILE @counter2 in (0,1,2) OR @counter2 >= @MaxListId <br />BEGIN<br />SET @counter2 = RAND()*10<br />END<br /><br />DECLARE @NewLastName VARCHAR(10)<br />DECLARE @CurrentPatientId INT<br /><br />WHILE @counter1 <= @MaxListId<br />BEGIN<br /><br />SELECT @CurrentPatientId = patient_id<br />FROM #LastNameList <br />WHERE ListId = @counter1<br /><br /><br /><br /><br />SELECT @NewLastName = last_name<br />FROM #LastNameList <br />WHERE ListId = @counter2<br /><br />SET @counter1 = @counter1 + 1<br />SET @addy = @addy + 1<br /><br />IF @counter2 >= @MaxListId<br />BEGIN<br />SET @counter2 = 0<br />END <br />SET @counter2 = @counter2 + 1<br /><br />UPDATE #temppat1<br />SET last_name = @NewLastName,<br />address = rtrim(convert(char(20),@addy)) + ‘ ‘ + ‘Main Street'<br />WHERE <br />patient_id= @CurrentPatientId<br />END
Why not use this: UPDATE MyTable
SET FullName = NEWID() or UPDATE MyTable
SET FullName = LEFT(NEWID(), 20) … using whatever number of characters you need for the column.
A hunch why it’s slow: you are doing an update per record instead of a single update for all records. Very often a single SQL statement is faster than a loop that processes one record at a time. What you are trying to do can be done in a single SQL statement, and doesn’t need a long script (see my post above) or a loop.
You can use merrillaldrich or Adrian’s approach for a faster approach.
Else in the srambling approach via loop, you could define appropriate indexes on the temp tables.
Thanks and Regards,