SSIS New Features in SQL Server 2008 – Part 1
The last series of articles covered topics about SSIS architecture, SSIS buffer management, types of transformation and how it relates to execution tree, parallel processing and considerations for SSIS on 64-bit computers. This aided to understand the SSIS architecture and its inner functions, it also assists in development of better performing packages, troubleshooting SSIS issues and optimisation of package performance.
SQL Server 2008 Integration Services provides a flexible, fast and scalable architecture that enables effective data integration in current business environments; through a number of new features and enhancements that improve performance and increase developer and administrator productivity. Improvements range from changes to the architecture— to better support package development and execution— to the addition of SSIS Designer tasks and components that extend SSIS capabilities and provide more effective data integration.
This article series will discuss the following new features in SSIS 2008:
- Lookup Enhancement
- Data Profiling Task
- Pipeline Memory Limiter
- Pipeline Performance – Multicast Transform and Threading in SQL Server 2008
- VSTA support for Script Task and Script Component
- Import & Export Wizard Enhancements
- Brief explanation of leveraging SQL Server 2008 features in SSIS viz. CDC and MERGE statement
In this article where reference is made to “SSIS 2008” it refers to the SSIS version with SQL Server 2008 whereas “SSIS 2005” refers to the SSIS version with SQL Server 2005.
Lookup Transformation Enhancement
Before covering further details on enhancement of Lookup transformation, below is a brief explantion of what Lookup transformation is and how it was until SQL Server 2005. If Lookup transformation in SSIS 2005 is already known, then jump to the next article on Lookup Transformation in SSIS 2008.
Lookup Transformation in SSIS 2005
The Lookup transformation performs lookups by joining (using equi-join, a process where each row in the transformation input must match at least one row from the reference dataset) data in input columns with columns in a reference dataset/table. The reference dataset can be an existing table or view, a new table or the result of an SQL statement. If there is no matching entry in the reference dataset, no join occurs and no values are returned from the reference dataset. This is an error, and the transformation fails, unless it is configured to ignore errors or redirect error rows to the error output. If there are multiple matches in the reference table, the lookup returns only the first match based on the lookup query.
- The join can be a composite join, where multiple columns in the transformation input to columns in the reference dataset.
- The lookups performed by the Lookup transformation are case sensitive.
The Lookup transformation uses an OLE DB connection manager to connect to the database that contains the data that is the source of the reference dataset.
The Lookup transformation has the following inputs and outputs:
- Input – records which will be matched with reference table
- Match output – handles the rows in the transformation input that match at least one entry in the reference dataset.
- Error Output – handles the rows in the transformation input that could not be matched with the reference table.
By default, the Lookup transformation treats rows without matching entries as errors.
To optimize performance, the lookup transformation may cache the reference table in memory depending on the caching mode being used (part or all of the reference dataset can be specified to be cached to improve performance, by using a caching SQL statement) as discussed below:
Full Cache/Pre-Cache – By default, the lookup uses Full cache mode. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. Afterward, SQL is not used anymore to process this data. This approach uses a lot of memory, while not needing SQL indexes anymore, having enough RAM to hold all the reference data is crucial. This adds additional startup time for the data flow, as the entire caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. Note that the lookup will not swap memory out to disk, so the data flow will fail if memory runs out.
When to use this cache mode:
- When accessing a large portion of the reference dataset
- When a small reference dataset is being used
- When the database is remote or under heavy load, and want to reduce the number of queries sent to the server
Keys to using this cache mode:
- Ensure that there is enough memory to fit the cache
- Ensure that there’s no need to pick up any changes made to the reference dataset
- Since the lookup query is executed before the data flow begins, any changes made to the reference table during the data flow execution will not be reflected in the cache
Partial Cache – In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached to be used the next time a matching row comes in. Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, lookup operations will be slower, as the database is queried more often and a SQL index on the reference table is required for this to perform well.
When to use this cache mode:
- When processing a small number of rows and it’s not worth the time to charge the full cache
- When using a large reference dataset
- When the data flow is adding new rows to the reference table
- When wanting to limit the size of the reference table by modifying query with parameters from the data flow
Keys to using this cache mode:
- Ensure that the cache size setting is large enough
- Use the Miss Cache appropriately (applicable in SQL Server 2008, discussed later)
- If the cache size isn’t large enough for rows, sort on lookup index columns if possible
No Cache – As the name implies, in this mode the lookup transform doesn’t maintain a lookup cache (with an exception to the last match around, as the memory has already been allocated). In most situations, this means the database will be hit for every row; and like Partial-Cache mode a SQL index on the reference table is needed for this to perform well.
When to use this cache mode:
- When processing a small number of rows
- When using non-repeating lookup indexes
- When the reference table is changing (inserts, updates, deletes)
- When experiencing severe memory limitations
Keys to using this cache mode
- Ensure that the partial cache mode isn’t the better choice
- A Lookup transformation that has been configured to use partial or no caching will fail if a lookup operation matches columns that contain null values, If full caching is used, which supports lookup operations on null values, the lookup operation succeeds.
- With full caching, the lookup will not swap memory out to disk, so the data flow will fail if memory runs out. Whereas with partial caching, if the cache fills up (cache reaches to its limit), the lookup transformation will start dropping the least seen rows from the cache to accomodate the new rows.
- Integration Services and SQL Server differ in the way they compare strings as SQL Server is not case sensitive but the SSIS lookup component is case sensitive and there could be a difference of collation between SQL and binary which lookup uses.
There are several scenarios where lookup transformation is required, consider a scenario of a sales table containing sales transactions of all the customers belonging from all over the world. If wanting sales records for those customers belonging from a specific region, US, EMEA, ASIA PACIFIC etc; the sales table can be used as input to the lookup transformation and region specific list of customers as the reference table. The output will contain, as expected, sales transactions only for the customer of that specific region. The below example demonstrates this in action.
Create a simple package with a data flow task that pulls order records from the source for the selected product only, the package will look something like this; here the input to lookup transformation is order table whereas the product table is used as reference table:
To configure the lookup transformation, right click on the Lookup transformation and click on Edit, a lookup transformation editor appear as shown below, it has three tabs, the first viz. “Reference Table” is a place to specify the source of reference table and whether the data will come from table/view or as a result of an SQL query.
The middle tab viz. “Columns”, specify the mapping of column(s) between the source table and the reference table, used in matching the records. Here output column(s) are selected from the reference table. The image below shows the column “Name” is selected to be output as a new column viz. “NewName” from the lookup transformation.
The last tab viz. “Advanced”, specify whether to use the Partial-Cache or No-Cache feature (By Default “Enable memory restriction” is not checked and hence Full-Caching will be used), also specify what size the cache will be when it is running on 32-bit machine and 64-bit machine (Default value is 5 MB).
The final result of running the package, 542 rows are flowing from source to lookup as input, reference table matches with 124 rows, another 418 rows which could not be matched are outputted to Lookup Error Output path.
The next article in this series will examine the changes to the Lookup Transformation in SSIS 2008.