Know your Data with Data Profiling

Below is the output for the column statistics profiler which only provides Minimum, Maximum, Mean or Average and Standard Deviations.
From the user point view, they would prefer to have additional statistics such as Variance.

Below is the Column Value distribution profiler.

Pattern profiling is very important since it will identify
what patterns you have and simple coding will not be enough to identify the
patterns of your data.

Above is the pattern of the product number where there are
three types of patterns. If there are different patterns in large percentages you could stop the data loading.

Constraints

  • You cannot have customized profiler techniques hence you
    have to live with what SSIS profiler ships with.
  • You need to use SQL Server 2000 or a later version. In case
    you need to use other sources, you need to use SQL Server as a staging
    environment (where you insert to a SQL Server table temporary) and perform profiling on staging data.
  • There is no direct way to verify your profile. For
    example, let’s say you want to stop data being inserted if null value
    percentage is more than 70% in a column. For this, you need to have some knowledge about XML and scripting which we will discuss in later article.
  • You cannot profile data which are flowing from the data flow.
    You have to insert data to a temporary table and then get that data.

What next

This article describes what
the data profiling control task offers. We will next discuss how
you can use this profiler output along with the script component. Also, since this
control at a basic level, we will explore some other tools available in the
market. 

Also, if you are wish to have any
example related to profiling which are not discussed in this article please
feel free to send your thoughts to dineshasanka@dbfriend.net

]]>

Leave a comment

Your email address will not be published.