SQL Server Integration Services an Introduction – Part 1
SQL Server Integration Services (SSIS) is a platform for building high performance data integration and workflow solutions. It allows creation of packages or SSIS packages which are made up of tasks that can move data from source to destination and alter it if required.
SSIS is basically an ETL (Extraction, Transformation, and Load) tool whose main purpose is to do extraction, transformation and loading of data but it can be used for several other purposes for example, to automate maintenance of SQL Server databases, update multidimensional cube data etc as well.
SSIS is a component of SQL Server 2005/2008 and is the successor of DTS (Data Transformation Services) which had been in SQL Server 7.0/2000. From an end-user perspective DTS and SSIS may appear similar, however they are actually quite different. SSIS has been completely written from scratch and overcomes several limitations of DTS. Though the list of differences between DTS and SSIS is quite large, something of note is that the internal architecture of SSIS is completely different from DTS. It has segregated the Data Flow Engine from the Control Flow Engine or SSIS Runtime Engine and hence improves the performance by a significant amount. Additional details about the architecture and internals of SSIS will be covered in the next article series “SQL Server Integration Services – An inside View”.
Note: In this article, when referring to “SSIS 2008” it is the SSIS version that comes with SQL Server 2008 whereas “SSIS 2005” refers to the SSIS version that comes with SQL Server 2005.
Creating an SSIS Package
There are three different ways to create SSIS packages, each is described below.
- Import and Export Wizard – Though one of the simplest way to create an SSIS package, it has very limited capability. No kind of transformation can be defined using this method. Though with SSIS 2008, there is an option to include a Data Conversion Transformation if there is a mismatch in data type between source and destination. The wizard is primarily used for simple data transfer from source to destination. For further details, refer to “Import and Export” section later in this article.
- The SSIS Designer – The SSIS Designer is hosted inside the Business Intelligence Development Studio (BIDS) as part of an Integration Services project. It is a graphical tool that can be used to create and maintain Integration Services packages. It has a toolbox which contains the various items needed for Control Flow, Data Flow Task as well as tasks needed for maintenance plans. The number of tasks in SSIS is much larger than what was available in the DTS. For more details, refer to “SSIS Designer” section later in this article.
- SSIS API Programming – SSIS provides API object model, which can be used in a variety of programming languages to create SSIS package programmatically. For more details refer to “SSIS API Programming” section later in this article.
As described in the introduction, SSIS creates packages which are composed of tasks that can move data from source to destination, and if necessary transform it. Within SSIS package the workflow can be defined, the SSIS runtime engine ensures the tasks inside the package are executed according to the workflow. Following is a description of the different tasks/components/executables of a package.
A package is a collection of tasks which are executed in an orderly fashion by SSIS runtime engine. It is an XML file, which can be saved on SQL Server or on a file system. A package can be executed by SQL Server Agent Job, DTEXEC command (a command line utility bundled with SSIS to execute a package; another similar utility DTEXECUI, has a GUI), from BIDS environment or by calling one package by another package (achieves modular approach). You can use DTUTIL utility to move package from file system to SQL Server or vice versa. Alternatively the undocumented sp_dts_getpackage/sp_ssis_getpackage and sp_dts_putpackage/sp_ssis_putpackage stored procedures which reside in msdb system database can be used.
Handles the main workflow of the package and determines processing sequence within the package. It consists of containers, different kinds of work flow tasks and precedence constraints.
Control Flow Tasks
A task is an individual unit of work. SSIS provides several inbuilt control flow tasks which perform a variety of workflow actions. They provide functionality to the package in much the same way that a method does in programming language. All the inbuilt tasks are operational task except Data Flow Task. Though there are several dozen inbuilt tasks for use, if required they can be extended and custom tasks can be written using VB/C# etc.
Containers group a variety of package components (including other containers), affect their scope, sequence of execution and mutual interaction. They are used to create logical groups of tasks. There are four types of containers in SSIS listed below:
- Task Host Containers – Default container, every task falls into it.
- Sequence Containers – Defines a subset of the overall package control flow.
- For Loop Containers – Defines a repeating control flow in a package.
- ForEach Loop Containers – Loops for collection, enumerates through a collection for example it will be used when each record of a record-set needs to be processed.
Precedence constraints link the items in a package into a logical flow and specify the conditions upon which the items are executed. It provides an ordinal relationship between various items in the package; which helps manage the order the tasks will execute. It directs the order of task execution and defines links among containers and tasks; evaluates conditions that determine the sequence in which they are processed. More specifically, they provide transition from one task or container to another.
The condition can either be Constraint or Expression or both. The constraint can be Success (Green Line), Failure (Red Line) and Complete (Blue Line). The package in the image below shows Script Task 1 will be executed only if the Execute SQL Task completed successfully; Script Task 2 will be executed irrespective of whether the Execute SQL Task completed successfully or failed; Script Task 3 will be executed only if the Execute SQL Task failed.
Apart from the above discussed constraints, conditions can also be defined as an expression with precedence constraints that is evaluated at runtime; depending on its value the transition is decided. In the image below, After Task A, Task B will be executed if the value of X >= Z or Task C will be executed if the value of X < Z. Constraint and expression can also be combined in a single condition with either AND or OR operator.
The concept of a variable in SSIS is same as the variables in any other programming language. It provides temporary storage for parameters whose values can change from one package execution to another, accommodating package reusability. It is used to dynamically configure a package at runtime. For example, to execute the same T-SQL statement or a script against a different set of connections. Depending on the place where a variable has been defined, its scope varies. Variables can be declared at package, container, task or handlers level.
In SSIS, there are two types of variables – System (pre-defined) variables whose values are set by SSIS (ErrorCode, ErrorDescription, MachineName, PackageName, StartTime etc.) and cannot be changed; User variables, created as required at the time of package development, can be assigned a value of the corresponding type.
Note: An exception applies here, there is a system variable called “Propagate” whose value can be changed from its default value TRUE to FALSE to stop event bubbling from a task to its parent and grand-parent. The next article in this series “SQL Server Integration Services – Features and Properties” discusses the Propagate variable in more detail.
A connection manager is a logical representation of a connection. SSIS provides different types of connection managers which use different data providers and enable packages to connect to a variety of data sources and servers.
A package can have multiple instances of connection managers and one connection manager can be used by multiple tasks in the package.
Some examples of connection managers are:
- ADO Connection Manager – Connects to ActiveX Data Objects (ADO) objects.
- ADO.NET Connection Manager – Connects to a data source by using a .NET provider.
- OLEDB Connection Manager – Connects to a data source by using an OLE DB provider.
- Flat File Connection Manager – Connect to data in a single flat file.
- FTP Connection Manager – Connect to an FTP server.
By default, every task that uses a connection manager during execution opens a connection, performs the operation and closes the connection before moving to the next task. Each task has its own connection. Consider a scenario where there are three tasks in package and they use the same connection manager; during runtime there would be three connections open and closed at the source. However, ideally all three tasks would be executed in a single connection; so that only one connection is open to the source irrespective of how many tasks use the connection. The RetainSameConnection property on the OLE DB Connection Manager enables multiple tasks to run in a single connection if the RetainSameConnection property equals TRUE.
The next article in this series will look at Data Flow in SQL Server Integration Services.