SSIS New Features in SQL Server 2008 – Part 3

One common approach to defining data quality is data profiling. A data profile is a collection of aggregate statistics about data that may include the number of rows in the Customer table, the number of distinct values in the State column, the number of null or missing values in the Zip column, the distribution of values in the City column, the strength of the functional dependency of the State column on the Zip column—that is, the state should always be the same for a given zip value etc.

SQL Server 2008 SSIS introduces the Data Profiling task in its toolbox, providing data profiling functionality inside the process of extracting, transforming, and loading data. By using the Data Profiling task, analysis of source data can be conducted more effectively, better understanding of source data and prevention of data quality problems before introduced into the data warehouse. Results of this analysis generate XML reports that can be saved to an SSIS variable or a file that can be viewed using the Data Profile Viewer tool. Data quality assessments can be performed on an ad hoc basis, the data quality process can also be automated by integrating quality assessment into the ETL process itself.

One important caveat about this task is that it works only with SQL Server sources (2000 or higher version).

The Data Profiling Task can compute eight different data profiles. Five of these profiles analyse individual columns, and the remaining three analyse multiple columns or relationships between columns and tables; for more details about each profile refer to BOL/MSDN. This example will discuss only two profiles as shown below:

Column NULL Ratio Profile – Reports the percentage of null values in the selected column. This profile helps identify problems in data, such as an unexpectedly high ratio of null values in a column. For example, profile a Zip Code/Postal Code column and discover an unacceptably high percentage of missing codes.

Column Pattern Profile – Reports a set of regular expressions that cover the specified percentage of values in a string column. This profile helps identify problems in data, such as string that are not valid. This profile can also suggest regular expressions that can be used in the future to validate new values. For example, a pattern profile of a United States Zip Code column might produce the regular expressions: d{5}-d{4}, d{5}, and d{9}. If other regular expressions are seen, data likely contains values that are not valid or in an incorrect format.

Drag a Data Profiling Task from the toolbox and drop it on Control Flow. Right click on the task and select edit, a Data Profiling Task Editor will come up (as shown below), it has three pages, on the General page specify the profile result destination type i.e. either File connection Manager or SSIS Variable.


Quick Profile Button configures the task quickly. A Quick Profile profiles a table or view by using all the default profiles and default settings. In the image below, only two profiles have been selected, as discussed above, out of the eight profiles.

Click on Profile Requests Page to see all the added profiles and set its properties, can also add another profile from this page, in the image below DimCustomer table of AdventureWorksDW database has been selected to profile. 


Run the package, as a file is the selected destination type, the result will be saved in an XML file on the file system.

Data Profile Viewer
Data Profile Viewer is stand-alone tool which is used to view and analyse the result of profiling. It uses multiple panes to display the profiles requested and the computed results, with optional details and drilldown capability as shown below:


Select the different profiles from the left side tree view and corresponding result will be displayed on the right side. For example the above image shows “Column Null Ratio Profiles” selected and corresponding result is displayed on the right side. In the image below “Column Pattern Profiles” is selected and on the right top side, the EmailAddress column is selected, in the bottom its result is being displayed.

To open the Data Profile Viewer, in the folder, C:Program FilesMicrosoft SQL Server100DTSBinn, run DataProfileViewer.exe.

The next article in this series will explore SSIS Pipeline Performance.


No comments yet... Be the first to leave a reply!