SQL Server Performance

Need Help Randomizing Customer Data

Discussion in 'SQL Server 2005 T-SQL Performance Tuning' started by upstart, Jun 21, 2007.

  1. upstart New Member

    Hi,

    I have a quick question.

    I have a SQL Server database with customer data in 1 table. I want to use this 'live' data for a testing database, but need a way to randomize the customer info. I figured out ways to do it for SSN# and telephone #, as well as address, but I am stuck on the customer names themselves.

    The customer's names are split between two columns, first_name and last_name within the same table. All I really want to do is update all the customer records with random combinations of the existing first and last names in the database so our 'testers' do not have access to real people names and their addresses/ssn/etc.

    Anyone know of a way to do this? So far, I have all the data imported into a temp table and was able to change randomize everything but the names.

    Thanks in advance
  2. merrillaldrich New Member

    Here's a starting point:


    INSERT INTO customers ( firstName, lastName )
    SELECT TOP 100000 first.name, last.name
    FROM ( SELECT TOP 1000 [name] FROM sampdatafnames ORDER BY NEWID() ) [first]
    CROSS JOIN ( SELECT TOP 1000 [name] FROM sampdatalnames ORDER BY NEWID() ) [last]
    ORDER BY NEWID()

    I have used this to make random sample customers from a table of first names and a table of last names. You could make the source tables using something like


    select distinct firstname into sampdatafnames from customer

  3. merrillaldrich New Member

    Or this might be even closer (convert from select to update) -- warning, it might be slow on a big table because of the cross join:


    declare @maxid int
    set @maxid = (select max(customerid) from customers)
    select * from customers
    inner join (
    SELECT top( @maxid )first.firstname, last.lastname,
    ROW_NUMBER() over ( order by newid() ) as customerid
    FROM ( SELECT firstname FROM customers ) [first]
    CROSS JOIN ( SELECT lastname FROM customers ) [last]
    ) randNames on customers.customerid = randnames.customerid

  4. bhushank21 New Member

    Here is another way of purely randomizing the last names.
    Each time this script is run it will always yield random last names, thus making it unpredictable for testers.

    DECLARE @LastNameList TABLE
    (
    ListId INT IDENTITY(1,1),
    CustomerId INT,
    LastName VARCHAR(10)
    )

    DECLARE @MaxListId INT
    DECLARE @counter1 INT


    INSERT INTO @LastNameList
    SELECT CustomerId, LastName
    FROM Customer

    SELECT @MaxListId = MAX(ListId) FROM @LastNameList

    SET @counter1 = 1

    DECLARE @counter2 INT
    SET @counter2 = 0

    WHILE @counter2 in (0,1,2) OR @counter2 >= @MaxListId
    BEGIN
    SET @counter2 = RAND()*10
    END

    DECLARE @NewLastName VARCHAR(10)
    DECLARE @CurrentCustomerId INT

    WHILE @counter1 <= @MaxListId
    BEGIN

    SELECT @CurrentCustomerId = CustomerId
    FROM @LastNameList
    WHERE ListId = @counter1

    SELECT @NewLastName = LastName
    FROM @LastNameList
    WHERE ListId = @counter2

    SET @counter1 = @counter1 + 1

    IF @counter2 >= @MaxListId
    BEGIN
    SET @counter2 = 0
    END
    SET @counter2 = @counter2 + 1

    UPDATE Customer
    SET LastName = @NewLastName
    WHERE
    CustomerId = @CurrentCustomerId
    END

    Let me know if you have queries.


    Thanks and Regards,
    Bhushan
    http://www.metacafe.com/watch/365213/wonder_what_you_are
  5. Adriaan New Member

    Replace with NEWID() - this returns a GUID like 'D2D9B2CD-8246-4E1B-85BA-1348BF23F264'.

    Each call to NEWID() will return a completely different GUID. If you include NEWID() in any query, the GUID will be different on each row.
  6. upstart New Member

    Hey, thanks for the help people. I was finally able to get my script to work somewhat, on small record sizes of 0-30,000. But the problem is, my database table that I am working on has 170,000 records and the script just keeps running. I am not really sure what to do or how to get it to finish without going on forever. Can anyone see anything sticking out at them that would make this run forever?<br /><br />Ill post my code below. Before this part of the code executes, I run a statement to select all the records into my temp table. Then I run this part to randomize the last names:<br /><br />CREATE TABLE #LastNameList <br />(<br />ListId INT IDENTITY(1,1),<br />patient_id char(<img src='/community/emoticons/emotion-11.gif' alt='8)' />,<br />last_name VARCHAR(20)<br />)<br /><br />DECLARE @MaxListId INT<br />DECLARE @counter1 INT<br />DECLARE @addy INT<br />DECLARE @mainstreet CHAR<br /><br /><br /><br />INSERT INTO #LastNameList <br />SELECT patient_id, last_name<br />FROM #temppat1<br /><br />SET @counter1 = 1<br />SET @addy = 100<br />SET @mainstreet = 'Main Street'<br /><br />SELECT @MaxListId = MAX(ListId) FROM #LastNameList <br /><br />DECLARE @counter2 INT<br />SET @counter2 = 0<br /><br />WHILE @counter2 in (0,1,2) OR @counter2 &gt;= @MaxListId <br />BEGIN<br />SET @counter2 = RAND()*10<br />END<br /><br />DECLARE @NewLastName VARCHAR(10)<br />DECLARE @CurrentPatientId INT<br /><br />WHILE @counter1 &lt;= @MaxListId<br />BEGIN<br /><br />SELECT @CurrentPatientId = patient_id<br />FROM #LastNameList <br />WHERE ListId = @counter1<br /><br /><br /><br /><br />SELECT @NewLastName = last_name<br />FROM #LastNameList <br />WHERE ListId = @counter2<br /><br />SET @counter1 = @counter1 + 1<br />SET @addy = @addy + 1<br /><br />IF @counter2 &gt;= @MaxListId<br />BEGIN<br />SET @counter2 = 0<br />END <br />SET @counter2 = @counter2 + 1<br /><br />UPDATE #temppat1<br />SET last_name = @NewLastName,<br />address = rtrim(convert(char(20),@addy)) + ' ' + 'Main Street'<br />WHERE <br />patient_id= @CurrentPatientId<br />END
  7. Adriaan New Member

    Why not use this:

    UPDATE MyTable
    SET FullName = NEWID()

    or

    UPDATE MyTable
    SET FullName = LEFT(NEWID(), 20)

    ... using whatever number of characters you need for the column.
  8. merrillaldrich New Member

    A hunch why it's slow: you are doing an update per record instead of a single update for all records. Very often a single SQL statement is faster than a loop that processes one record at a time.

    What you are trying to do can be done in a single SQL statement, and doesn't need a long script (see my post above) or a loop.
  9. bhushank21 New Member

    You can use merrillaldrich or Adrian's approach for a faster approach.
    Else in the srambling approach via loop, you could define appropriate indexes on the temp tables.


    Thanks and Regards,
    Bhushan
    http://www.metacafe.com/watch/365213/wonder_what_you_are

Share This Page