Know your Data with Data Profiling
Below is the output for the column statistics profiler which only provides Minimum, Maximum, Mean or Average and Standard Deviations. From the user point view, they would prefer to have additional statistics such as Variance.
Below is the Column Value distribution profiler.
Pattern profiling is very important since it will identify what patterns you have and simple coding will not be enough to identify the patterns of your data.
Above is the pattern of the product number where there are three types of patterns. If there are different patterns in large percentages you could stop the data loading.
- You cannot have customized profiler techniques hence you have to live with what SSIS profiler ships with.
- You need to use SQL Server 2000 or a later version. In case you need to use other sources, you need to use SQL Server as a staging environment (where you insert to a SQL Server table temporary) and perform profiling on staging data.
- There is no direct way to verify your profile. For example, let’s say you want to stop data being inserted if null value percentage is more than 70% in a column. For this, you need to have some knowledge about XML and scripting which we will discuss in later article.
- You cannot profile data which are flowing from the data flow. You have to insert data to a temporary table and then get that data.
This article describes what the data profiling control task offers. We will next discuss how you can use this profiler output along with the script component. Also, since this control at a basic level, we will explore some other tools available in the market.
Also, if you are wish to have any example related to profiling which are not discussed in this article please feel free to send your thoughts to firstname.lastname@example.org