Know your Data with Data Profiling

Introduction

According to the “Data Quality and the Bottom Line:
Achieving Business Success through a Commitment to High Quality Data.” The Data
Warehousing Institute. Report Series 2002, data quality problems cost U.S. businesses more than $600 billion per year.

Why such a cost? If your data is not up to quality,
your KPIs and data mining predictions will be incorrect and could lead poor business
decisions.

Thus, you need to profile your data to highlight if your data does not meet your profile data
and you can
then either reject, update or flag the data.

Data Profile Techniques

The most common data profiling techniques in the industry are currently:

  • Distinct lengths of string values in a column and the
    percentage of rows in the table that each length represents. In some cases data length should be a defined value. For example, zip codes, telephone
    numbers has defined lengths.
  • Null value percentage. In some cases, you are allowed to
    have null values but there should be a limit. For example, it is not always
    an available telephone number, but a high percentage of null values for this
    data type may indicate poor data quality.
  • Percentage of regular expressions. Telephone numbers are the good
    candidates for this  technique. This will identify what percentages of
    values you have in different formats. In case of the telephone numbers, you can
    get formats like, (999)999-9999 or 999999999 or 999-999-999 or [999]-999-9999.
  • Distinct values. This technique will identify
    your data values pattern. For example, one bank branch can see from which part
    of the world their customers are coming from. If  data sets are varying a lot by
    region they flag that a data set for future intervention.

This information can also be
used to create indexes especially a filtered index depending on the
values of the data.

  • Statistical values.This is one of the common if not  the most common in profiling technique. This technique will have Minimum,
    maximum, average, and standard deviation for numeric columns and minimum and
    maximum for date/time. From this you can analyze what
    range your age column should be and what should be your birth date range.
  • Candidate Key Columns. In many legacy databases, there are
    cases where you don’t have any key columns. You can use this technique to find  what is the candidate key column out of
    the existing columns.
  • Dependency Values Columns. There are instances where you can
    derive one column from another column. For example, bonus and salary are
    related.

Using SQL Server Integration
Services

Prior to SQL Server 2008, there were
no tailor made profiler options in SSIS. Most BI professional used their own
scripts or  third party tools. But with SQL Server 2008 you have the control
flow task named  Data Profiling Task.

After dragging and droping the control to
the control flow, let us see what features available with this are.

First double click the control and
you will see following screen.

Here, you need to define the destination file where your
profiler results will be written. There is no compulsory requirement for the
file extension to be xml but since the output is in xml format I have set this  to
xml.

In the above window, click the Quick Profile button and you will be taken to
following dialog:

Here you have to select the connection which can only be an
ADO.NET connection. Then you can select either a table or a view and then
select all the profiles you would like to analyze.

Continues…

Pages: 1 2 3




Array

No comments yet... Be the first to leave a reply!

Software Reviews | Book Reviews | FAQs | Tips | Articles | Performance Tuning | Audit | BI | Clustering | Developer | Reporting | DBA | ASP.NET Ado | Views tips | | Developer FAQs | Replication Tips | OS Tips | Misc Tips | Index Tuning Tips | Hints Tips | High Availability Tips | Hardware Tips | ETL Tips | Components Tips | Configuration Tips | App Dev Tips | OLAP Tips | Admin Tips | Software Reviews | Error | Clustering FAQs | Performance Tuning FAQs | DBA FAQs |