SQL Server Performance

Update 1 field from related table - large tables - crazy query plan!

Discussion in 'SQL Server 2005 T-SQL Performance Tuning' started by cboling, Aug 8, 2007.

  1. cboling New Member

    I have two tables:
    Table 1
    Name: Account
    Rows: 40 million
    Clustered Index: AccountID
    Non-Clustered Index:
    Table 2
    Name: Cider
    Rows: 1 billion
    Clustered Index: [compound]
    Non-Clustered Index: AccountID

    Common field: AccountID
    Account (1) <--> Cider (0-Many) (some cider records have NULL AccountID)
    I just recently created the Cider table for reporting purposes. It's a beast, composed of data from several tables. Took over 20 hours to build. That turned out to be the easy part...
    I added a single field to Cider. I want to populate it from Account.
    The obvious query to do this was:
    update Cider set Cider.EntryDate = Account.EntryDate from Cider inner join Account on Cider.AccountID=Account.AccountID
    or, even (I tried this later; the query plan is about the same):
    update Cider set Cider.EntryDate = (select Entry Date from Account where Account.AccountID = Cider.AccountID)
    or similar.
    How would *I* do this? Well, I'd go down the Cider table row by row, and for each row, look up the Account record, grab the field, and update it.
    That's not how SQL server would do it, though. The query plan looks insane, with an est. cost of 1.2 million, and involves joining the two tables, doing 2 different sorts on the whole mess, and THEN applying the results back to the Cider table.
    I tried running it, and after 45 hours (after 25 the disk activity looked like it might be to the update stage of the query) I finally canceled it, but then it wigged out and got stuck in this endless loop where it would spend a full day rolling it back, act like everything was okay, but if you restarted SQL Server (e.g. for a reboot) it would start rolling it back all over again, like it had never truncated the log, and I finally ended up restoring the database from backup (a whole adventure itself -- gunzip doesn't handle files over 100GB very well under Windows...)
    Anyway, so I started looking at possible ways of hinting and coaxing it to do something smarter. After looking closer at the plan, I *finally* realized that the Account.EntryDate field is indexed, so it was trying to use that. I tested it with a non-indexed field, and MAGIC! It chose a clustered scan on the Account table, eliminated a lot of the extra stuff, and the cost dropped to 66,000. A bit more reasonable, anyway. I tried specifying the hint "INDEX(1)" on account, though, and it obeyed it, but still wanted to be "stupid" and order things wrong for the update, and take a 1.2 million cost.
    And, of course, if I open it up to parallelism, then it'll try to do all sorts of funky stuff, none of which looks likely to make it run faster....

    This should be a dirt simple operation. I'm tempted to do it the "hard" way, i.e. create a cursor on the Cider table, look up each related record, etc. That can't be the best way to do it, but I can't seem to come up with any way to make SQL Server do it without elaborately shooting itself in the foot. Efficiency doesn't matter so much when it's 1,000 rows, but at 1,000,000,000 it becomes rather critical....
    Any ideas? This one really has me scratching my head.

  2. Adriaan New Member

    You do realize that you're creating redundant data? You might as well join the Account table to your reporting table based on AccountID, and read that date from Account.
    Once you have a huge table, you don't want to update all rows in one go. Better approach would be to update in batches.
    SET ROWCOUNT 10000
    (SELECT * FROM Cider INNER JOIN Account ON Cider.AccountID = Account.AccountID
    WHERE EntryDate IS NULL AND Account.EntryDate IS NOT NULL)

    UPDATE Cider SET EntryDate = Account.EntryDate
    FROM Cider INNER JOIN Account ON Cider.AccountID = Account.AccountID
    WHERE Cider.EntryDate IS NULL AND Account.EntryDate IS NOT NULL

    If this is not a live database, you might also select a different recovery model to reduce the impact that the transaction log(s) have on any large data manipulation.
  3. thomas New Member

    Updating, as you're finding out, is a hugely expensive operation. The way to have done this would have been to have the column there in the first place and populated it during the initial 20 hour load (hey, isn't hindsight a great thing!?).
    You could try batching the update up into more manageable chunks using a WHILE loop and see if you can make progress that way.

Share This Page