Know your Data with Data Profiling

Introduction

According to the “Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data.” The Data Warehousing Institute. Report Series 2002, data quality problems cost U.S. businesses more than $600 billion per year.

Why such a cost? If your data is not up to quality, your KPIs and data mining predictions will be incorrect and could lead poor business decisions.

Thus, you need to profile your data to highlight if your data does not meet your profile data and you can then either reject, update or flag the data.

Data Profile Techniques

The most common data profiling techniques in the industry are currently:

  • Distinct lengths of string values in a column and the percentage of rows in the table that each length represents. In some cases data length should be a defined value. For example, zip codes, telephone numbers has defined lengths.
  • Null value percentage. In some cases, you are allowed to have null values but there should be a limit. For example, it is not always an available telephone number, but a high percentage of null values for this data type may indicate poor data quality.
  • Percentage of regular expressions. Telephone numbers are the good candidates for this  technique. This will identify what percentages of values you have in different formats. In case of the telephone numbers, you can get formats like, (999)999-9999 or 999999999 or 999-999-999 or [999]-999-9999.
  • Distinct values. This technique will identify your data values pattern. For example, one bank branch can see from which part of the world their customers are coming from. If  data sets are varying a lot by region they flag that a data set for future intervention.

This information can also be used to create indexes especially a filtered index depending on the values of the data.

  • Statistical values.This is one of the common if not  the most common in profiling technique. This technique will have Minimum, maximum, average, and standard deviation for numeric columns and minimum and maximum for date/time. From this you can analyze what range your age column should be and what should be your birth date range.
  • Candidate Key Columns. In many legacy databases, there are cases where you don’t have any key columns. You can use this technique to find  what is the candidate key column out of the existing columns.
  • Dependency Values Columns. There are instances where you can derive one column from another column. For example, bonus and salary are related.

Using SQL Server Integration Services

Prior to SQL Server 2008, there were no tailor made profiler options in SSIS. Most BI professional used their own scripts or  third party tools. But with SQL Server 2008 you have the control flow task named  Data Profiling Task.

After dragging and droping the control to the control flow, let us see what features available with this are.

First double click the control and you will see following screen.

Here, you need to define the destination file where your profiler results will be written. There is no compulsory requirement for the file extension to be xml but since the output is in xml format I have set this  to xml.

In the above window, click the Quick Profile button and you will be taken to following dialog:

Here you have to select the connection which can only be an ADO.NET connection. Then you can select either a table or a view and then select all the profiles you would like to analyze.

Continues…

Leave a comment

Your email address will not be published.