SQL Server Performance

SSIS Package performance tuning - 5 tables betwenn 5000 and 15 million rows

Discussion in 'SQL Server 2005 Integration Services' started by SQL_Guess, Feb 28, 2008.

  1. SQL_Guess New Member

    The challenge: I have to extract and convert data between 2 SQL server systems - only 4 tables on the source systems, 8 tables on the target system. Source tables have between 5,000 rows and 16,000,000 rows. For most of the tables (for example Customer, which goes into 4 target tables), there will be 1 row in target tables for each row in the mapped source system table - so my 13.5M customer rows will end up as around 40M rows across the 4 target tables. So far, so good. But - this is a 24x7 online retail web-site, and to get the data across as a clean process, we require the smallest possible duration.

    I have progressed on the customer migration, and am testing on a test environment (2xdual core HT processors, 4 GB ram) which was 2.15 million rows. Live environment is likely to be a 4xdual core with 8-16 GB ram.

    I am trying to optimize the extract data flow, and have read the SSISperfTuning doc. I am now trying to put that into practice.
    I have a row size of approx 340 bytes, so based on that, and my test environment of 2.15 million rows, I work out at around 700 MB ram required to buffer the data. That is a factor of 7 times greater than the max buffer space for a data flow of 100 MB, which it seems, means I should divide the base MaxBufferRows (10000) by 7 to go down to 1400 rows?

    I see a LOT of the following messages in my progress, when running with default settings:
    [DTS.Pipeline] Information: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 30 buffers were considered and 30 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

    The design of the data flow at the moment is:


    ..........................................|--target table 1
    SOURCE SP ---- MULTICAST---|--target table 2
    ..........................................|--target table 3
    ..........................................|--target table 4

    any thoughts on Buffer tweaking, corrections to my assumption and other hints/techniques?
  2. martins New Member

    Hi,
    The best way to extract data is to stage it first, meaning that you insert as is from your source into a staging table. You can then perform all transformation processes (cleaning data, looking up foreign keys, etc) by stored procedure on the staging table before inserting into your target tables.
    Considering that your target server is 24/7, by doing it in this way your final insert/update into your target tables can be done in smaller batches to reduce impact (if necessary).
    Hope this helps.
  3. SQL_Guess New Member

    Hi Martins (en hoe gaan dit?)...
    We are staging the data - I jsut didn't describe that above because it was getting so long a post.
    The inital extract of the tables from Source to Staging, in testing, processes about 1 million rows a minute, with the 4 extracts running in parallel. On those transfers (straight bulk transfer, no transformations, to exactly the same schema), I am also seeing the pipeline buffer 'informationals' - the process still runs with error, but those messages imply to me, that a tuning exercise on that would be useful.
    Those tuning lessons would then also be applicable to the SSIS package doing the transform and load, since I am seeing the buffer pipeline erros there. The SP I listed about is extracting data from the staging tables.
    "complete design"
    Package 1
    LIVE server table 1--------------------> conversion server table 1
    LIVE server table 2--------------------> conversion server table 2
    LIVE server table 3--------------------> conversion server table 3
    LIVE server table 4--------------------> conversion server table 4
    Package 2
    as described in initial post (I've only listed the section of code I've worked on - there is more to package 2 than just an SP --> multicast--> 4 tables, naturally.
  4. martins New Member

    Nee dit gaan lekker man...baie lekker [:D]
    Ok cool...the fact that you are staging the data is a good start. What I would propose then (with the limited knowledge I have of your system) is to add an identity column in your staging table if you do not already have one.
    You will then be able to easily add a while loop to your stored proc (as a last step after all transformations) that inserts/updates a limited number of records at a time. You can even delete data out of the staging table as soon as it has been inserted/updated in the target.
    Doing it in this way by inserting lets say a 1000 records at a time will ensure that locks are not held for too long and stalling another process from doing its work.
    Hope this helps...
  5. martins New Member

    Just one more suggestion...ensure that your staging tables are one the target server.
  6. SQL_Guess New Member

    I'll try and sanitize a image in the documentation, and upload it - but at the moment, deadlines and managers are shouting :)
    Yeah - I'm very familiar with batching techniques (and the benfitis in reducing contention on systems), but thanks for the thought.
    I probably didn't state this well, but the migration target is not a Live system yet - it is the source system that is 24x7 - so based on that, my load will be a bulk load. The performance drivers is that after I complete the predominance of the conversion process, the new encryption routines have to process RBAR to create the new encrypted values - the widely varying estimates at the moment indicate 5-10 hours to do that, and our target system is supposed to be up 12 hours after I start the extraction process, to I have a 2 hours window that may still shrink.
    The target system will be our 'new world' OLTP, and what I am doing is the bulk-load of the customer base data (which should really have been done by the project months ago - I am part of the Live DB team, but suddenly, 3 weeks to go live, someone realized that they've only covered migrating changes to the old system - no-one had done anything about get the customer base across). Ahh well, no point in complaining about that.
    I had currently envisaged doing the meat of the conversion on a standalone 3rd server, but moving the staging environment onto the target system does mean less network traffic - so that's a good idea. Will need to get my space utilization estimates done so I can include that an ensure that server has the capacity, but realistically, I don't see that being a problem
  7. martins New Member

    Well, it sounds like you're on top of things. We are currently loading 1GB of data into our warehouse daily, and it takes about 2hrs...so there is hope :)
    Good luck.
  8. SQL_Guess New Member

    Do you have any other configuration tuning that you did on buffers for the data flows?
    I've been playing with DefaultbufferMaxRows, and DefaultBufferMaxSize, trying to reduce the number of 'buffer manager' DTS pipeline information messages, but so far haven't been hugely successful.
    thanks for the responses - I appreciate it
  9. martins New Member

    We use SQL 2000, and I think the default buffer size in DTS is 1000 records. I must say that it has worked very well at the default, and if I remember correctly it only slowed down when upped to over 5000. With that being said I must also add that we load from flat files, but even with huge files (6 million rows) it has worked well.
    My advice would be to stick with the defaults if you haven't really seen a great improvement after playing...they usually perform well enough under normal circumstances.
    On other thing you can do to ensure good performance of the extract process is to ensure that there are no indexes on the staging tables if you are going to transfer alot of data. I usually only have the identity column as a clustered index, and will only build non-clustered indexes within the process if they area really needed.
    Hope this helps.
  10. martins New Member

    Sorry for giving you lots of bits and pieces, but everytime I finish my response I think of something else...
    Another thing that would benefit you is to have a separate staging database (if you can). Your OLTP database probably has recovery model set to "Full", which will be bad for your big transfers. Rather ceate a separate db for staging with recovery model "Simple" or "Bulked Logged", which will also help with managing it differently.
  11. SQL_Guess New Member

    no problem - each little bit or piece might be the one that helps reduce my problems.
    Staging database is already in simple mode, and is a seperate database.
    Target database will be full, but I am considering that, for the bulk load of base data, set the target system to simple mode, removing any ' OLTP performance indexes', running the load, and then adding in the performance indexes, changing recovery model to full, doing a backup...
  12. martins New Member

    good idea
  13. atulmar New Member

    I have approx same configuration and network speed is good (100 MBps LAN, with same domain/location of servers).
    I have 12 tables (In most of the tables data is just denormalized, some tables are having even more 200 columns and minimum number of columns is 14, so the rowsize varies between 100 bytes to 6+ KB.
    Extraction into staging and loading into destination processes are there. Same case as here, the data to extract is less than the destination (after load). The size taken by these tables is around 30+ GB at least, I have not measured, but saw the used space that increase in database after run.
    From source I need to extract around 14+ million rows that will be inserted to multiple tables in destination 55+ million. This process takes around 4.5 - 6 hours to complete. In this I am to using many lookups that are caching again millions of records. In my extraction phase there are many source tables (20+) that are joined to extract data, and a lot of transformation(that is done using Sprocs in staging) to make it ready for destination. I am using fast load everywhere with 10K batch size.
    Only thing I have not used here is Multicast transformation, and I dont get these buffer warnings. I load the tables seperately even the source is same, and it runs faster than loading the tables with same source with multicast because of row size difference in tables.
  14. SQL_Guess New Member

    Well, it's been a few months since I posted this - we've gone live with this, and in the end, I had so many changes in the requirements it was scary (and yet, typical for this client).
    Eventually, we sourced from 4 transactionally replicated tables, which meant I was finally able to remove any impact on our source system, and have the control to be able to get point-in-time control (the 4 tables are transactionally consistent at source).
    In the final live migration, we had just over 40 million rows in the source system, and those ended up creating a little over 200 million rows in the target system. End to end, the extract processing (Extract, Transform and Load) with staging of data, took just under 3 hours, which was quite an achievement for us, and significantly within the time I was allocated.
    It was quite gratifying to see that the basic principles we have discussed here, applied quite well in SSIS in the end, and while I am still think there was some space for optimization with the Buffercaching for some of the data flows, the final result was considered to be the most successful part of the project, which is nice :D
  15. satya Moderator

    Appreciate your solution feedback on the issue, that really helps the users having similar issue.
    There are few advantages within SSIS you can take such as caching where the performance of Data Flow can be optimized, provided the source data is stable.
  16. martins New Member

    Yeah, thanks for the feedback and well done on a very successful implementation [<:eek:)]

Share This Page