SQL Server Integration Services (SSIS) is a Business Intelligence tool which can be used by database developers or administrators to perform Extract, Transform & Load (ETL) operations. In my previous article entitled Using Web Service and XML Task in SQL Server 2008 Integration Services I discussed how to use the Web Service Task and XML Task which are available in SQL Server 2008 Integration Services. In this article I will examine the use of the Data Profiling Task and FTP Task.
Data Preparation Tasks in SSIS
The data preparation tasks within SSIS can be used to retrieve data or to validate the quality of data. The different types of data preparation tasks which are available in SSIS are the File System Task, Web Service Task, XML Task, Data Profiling Task and FTP Task. If you are new to SQL Server Integration Services I would recommend you to start with my previous article entitled SQL Server 2008 Integration Services Tasks to understand different types of Integration Services Tasks which are available in SQL Server 2008 Integration Services.
Overview of the Data Profiling Task
The Data Profiling Task is a new task which was introduced with SQL Server 2008 Integration Services. This task can be used by SSIS developers to easily and effectively profile through source system to understand the quality of data before it can be finally loaded to a data warehouse using the ETL process. There are eight inbuilt data profiles available within Data Profiling Task. Among the eight profiles available, five can be used to analyze individual columns and the remaining three can be used to analyze multiple columns or relationships existing between columns and tables. The Data Profiling Task internally uses the ADO.Net connection to connect with SQL Server. The Data Profiling Task returns an XML output which can either be saved as an XML file or to an SSIS variable. If you are storing the returning XML output to an XML file you can view the XML file at a later stage using Data Profiler Viewer utility which is available at “C:Program FilesMicrosoft SQL Server100DTSBinnDataProfileViewer.exe”.
Below are the five profiles which can be used to analyze individual columns.
- Column Length Distribution
- Column Pattern
- Column NULL Ratio
- Column Statistics
- Column Value Distribution
Below are the three profiles which can be used to analysis multiple columns or relationships between columns and tables.
- Candidate Key
- Functional Dependency
- Value Inclusion
Worked example using the Data Profiling Task
We will run through an example using “Column NULL Ratio” to profile the dbo.DimCustomer table of AdventureWorksDW2008 database. Using Column Null Ratio you can identify the percentage of Null Values within selected columns.
1. Create a new SQL Server Integration Services Project and rename the default package DataProfilingTaskExample.dtsx
2. Double click the DataProfilingTaskExample.dtsx package to open it in design mode.
3. Create a new folder named DataProfiling within the C Drive (the XML output from the Data Profiling Task will be saved in this folder).
4. Right click within the Connection Managers window and select New File Connections…. This will open up File Connection Manager Editor and provide the details as shown in the snippet below and click OK to save the File Connection.
5. Drag and drop the Data Profiling Task from the toolbox to the Control Flow window and rename it “DimCustomer – Data Profiling Task” , then double click the Data Profiling Task to open the editor window as shown in the snippet below.
Within the Data Profiling Task Editor, click on General tab and provide the details as shown in the snippet above.
6. Click the Quick Profile… button which will open the “Single Table Quick Profile Form” dialog where you can quickly configure the Data Profiling Task. Create an ADO.Net Connection and then choose the table or view against which you want to use “Column Null Ration Profile”. Finally, click OK to save the changes.
7. Select the Profile Request Page on the left of the panel to view detailed information related to the Column Null Ration Profile, in this screen you can even add any new Profiles or make changes to the default settings for the current profile. Click OK to save the changes to the Data Profiling Task.
8. Finally, to execute the package, right click the DataProfilingTaskExample.dtsx package from the solution explorer and select the Execute Package option from the drop down list. Once the package has successfully executed you will be able to see the below screen within the control flow designer.
9. Once the package has successfully executed you will be able to see theDimCustomerDataProfilingResults.XML files within the C:DataProfiling folder. This file can be viewed using the Data Profiler Viewer utility which is available at “C:Program FilesMicrosoft SQL Server100DTSBinnDataProfileViewer.exe”