This article is part 2 of a 4 part series that introduces SQL Server Integration Services (SSIS). This article shows how to use the Data Flow Task in SSIS.
The Data Flow Task (DFT), using the SSIS Pipeline engine, manage the flow of data from the data source adapters to the data destination adapters and let the user do necessary transformations, clean, and modify the data. Note:
The Data Flow Tasks (DFT) is not a separate component outside Control Flow; it is placed inside a Control Flow only. A separate heading/section is provided for this information to give greater emphasis as it is the most important task in SSIS. A DFT can include multiple data flows. If a task copies several sets of data, and if the order that the data is copied is not significant, it can be more convenient to include multiple data flows in a single DFT. In the first image below, the DFT has one data flow with two transformations before writing to the destination. Whereas the second image shows the DFT has two data flows, the first has two transformations and the second has three transformations.
The transformation changes the data to a desired format. It performs modifications through a variety of operations, such as aggregation (e.g. averages or sums), merging (of multiple input data sets), distribution (to different outputs), data type conversion or reference table lookups (using exact or fuzzy comparisons). Below are described some inbuilt transformations:
- Derived Column Transformation – creates new column values by applying expressions to transformation input columns. The result can be added as a new column or inserted into an existing column as a replacement value.
- Lookup Transformation – performs lookups by joining data in input columns with columns in a reference dataset. Usually used in a scenario when working with a subset of master data set and seeking related transaction records.
- Union All Transformation – combines multiple inputs and gives UNION ALL to the multiple result-sets.
- Merge Transformation – combines two sorted datasets into a single sorted dataset; is similar to the Union All transformations. Use the Union All transformation instead of the Merge transformation in case if the inputs are not sorted, the combined output does not need to be sorted or the transformation has more than two inputs.
- Merge Join Transformation – provides an output that is generated by joining two sorted datasets using either a FULL, LEFT, or INNER joins.
- Conditional Split Transformation – route data rows to different outputs depending on the content of the data. The implementation of the Conditional Split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified output. This transformation also provides a default output, so if a row matches no expression it is directed to the default output.
- Multicast Transformation – distributes its input to one or more outputs. This transformation is similar to the Conditional Split transformation. Both transformations direct an input to multiple outputs. The difference is that the Multicast transformation directs every row to every output, and the Conditional Split directs a row to a single output.
There are several inbuilt transformation tasks available inside SSIS Designer to use. If required these transformations can be extended and custom transformations can be written.
Data Paths connect data flow components inside a DFT. Though it appearing like Precedence Constraint of Control Flow, it is not the same. Data Paths show the flow of data from one component of DFT to another whereas Precedence Constraint shows the control flow or ordinal relationship between control flow tasks. Data Path contains the meta-data of the data flowing through the path. For example, column information such as name, type, size etc. While debugging a data viewer can be attached to a Data Path to see the data flowing through that data path. Note:
The data viewer shows the data of one buffer at a time, clicking the next button shows data from the next buffer. SSIS buffer management is discussed in the next article “SQL Server Integration Services – An inside View”. Data Source Adapters
Data Source Adapters or simply the Source Adapters facilitate the retrieval of data from various data sources. It uses connection managers which in turn use different data providers to connect to heterogeneous sources for example flat file, OLE DB, .NET Framework data providers etc. Data Destination Adapters
Data Destination Adapters or simply Destination Adapters loads output of the data flow into target stores, such as flat files, database, or in-memory ADODB record-sets etc. Similar to Source Data Adapters, It uses connection managers which in turn use different data providers to connect to heterogeneous destination. For example flat file, OLE DB, .NET Framework data providers etc. In the example below OLEDB will be used to discuss properties/settings in details as it is one of the most commonly used.
- Data Access Mode – allows definition of the method to upload data to the destination. The fast load option will use the BULK INSERT statement instead of INSERT statement.
- Keep Identity – if selected the identity values of source will be preserved and uploaded the same into the destination table, else the destination table will create its own identity values if there is an identity type column.
- Keep Nulls – if selected the null values of the source will be preserved and uploaded into the destination table. If any column has a default constraint defined at destination table and NULL value coming from the source for that column, then the default value will be inserted into the destination table.
- Table Lock – if selected the TABLOCK will be acquired on the table during data upload. This is recommended if the table is not being used by any other application at the time of data upload, as it removes the overhead of lock escalation.
- Check Constraints – if selected the pipeline engine will check the table constraint for incoming data and fail if it violates it. The recommendation is to uncheck this setting if constraint checking is not required as it will improve performance.
- Rows per batch – blank text box indicates its default value -1. If so all incoming rows will be considered as one batch. A specified nonzero, positive integer will direct the pipeline engine to break the incoming rows in multiple chunks of N (what you specify) rows. This enables the number of rows in a batch to be specifically defined.
- Maximum insert commit size – the specified batch size that the OLE DB destination tries to commit during fast load operations; it operates on chunks of data as they are inserted into the destination. If a value is provided for this property, the destination commits rows in batches that are the smaller than the Maximum insert commit size or the remaining rows in the buffer that is currently being processed.
It’s good practice to set the value for the above two settings, as having a large batch or leaving the default value will negatively affect memory performance, especially in the tempdb. It is recommended to test the scenario and specify optimum values for these settings depending on the environment, load and pull. The next article in this series will examine the Import and Export Wizard in SQL Server Integration Services.