Know your Data with Data Profiling
According to the “Data Quality and the Bottom Line:
Achieving Business Success through a Commitment to High Quality Data.” The Data
Warehousing Institute. Report Series 2002, data quality problems cost U.S. businesses more than $600 billion per year.
Why such a cost? If your data is not up to quality,
your KPIs and data mining predictions will be incorrect and could lead poor business
Thus, you need to profile your data to highlight if your data does not meet your profile data
and you can
then either reject, update or flag the data.
Data Profile Techniques
The most common data profiling techniques in the industry are currently:
- Distinct lengths of string values in a column and the
percentage of rows in the table that each length represents. In some cases data length should be a defined value. For example, zip codes, telephone
numbers has defined lengths.
- Null value percentage. In some cases, you are allowed to
have null values but there should be a limit. For example, it is not always
an available telephone number, but a high percentage of null values for this
data type may indicate poor data quality.
- Percentage of regular expressions. Telephone numbers are the good
candidates for this technique. This will identify what percentages of
values you have in different formats. In case of the telephone numbers, you can
get formats like, (999)999-9999 or 999999999 or 999-999-999 or -999-9999.
- Distinct values. This technique will identify
your data values pattern. For example, one bank branch can see from which part
of the world their customers are coming from. If data sets are varying a lot by
region they flag that a data set for future intervention.
This information can also be
used to create indexes especially a filtered index depending on the
values of the data.
- Statistical values.This is one of the common if not the most common in profiling technique. This technique will have Minimum,
maximum, average, and standard deviation for numeric columns and minimum and
maximum for date/time. From this you can analyze what
range your age column should be and what should be your birth date range.
- Candidate Key Columns. In many legacy databases, there are
cases where you don’t have any key columns. You can use this technique to find what is the candidate key column out of
the existing columns.
- Dependency Values Columns. There are instances where you can
derive one column from another column. For example, bonus and salary are
Using SQL Server Integration
Prior to SQL Server 2008, there were
no tailor made profiler options in SSIS. Most BI professional used their own
scripts or third party tools. But with SQL Server 2008 you have the control
flow task named Data Profiling Task.
After dragging and droping the control to
the control flow, let us see what features available with this are.
First double click the control and
you will see following screen.
Here, you need to define the destination file where your
profiler results will be written. There is no compulsory requirement for the
file extension to be xml but since the output is in xml format I have set this to
In the above window, click the Quick Profile button and you will be taken to
Here you have to select the connection which can only be an
ADO.NET connection. Then you can select either a table or a view and then
select all the profiles you would like to analyze.