Data Cleansing with SSIS

Next is Advanced tab.


The Advanced tab contains the settings for controlling the fuzzy logic algorithms that assign groupings to text in the input stream. You can set the names of the three additional fields that will be added automatically to the output of this transformation. These fileds are named “_key_out”, “_key_in” and “_score” by default. As the rows flow through the pipeline, Fuzzy Grouping assigns an ID to each of them, which is the “_key_in”. When Fuzzy Grouping groups a collection of rows, it determines which row should be the representative. It then assigns all the “_key_out” columns of the rows in the group to the “_key_in” value of the representative. As a result, if you want to write only the representative rows to your output, you can filter the Fuzzy Grouping output through a conditional split transform that selects only rows where “_key_in” equals “_key_out”.

The token delimiters can also be set if, for example you don’t want the comparison process to break incoming string by . (dot) or spaces. The default for this setting is all common delimiters.

To visualize what is happening from a fuzzy lookup grouping, I have added two data grid views, one before and one after the fuzzy group.


We had three non-matching customers which are shown in the top data viewer grid.

The next data grid is the data grid is generated after the Fuzzy Group. You can see that “_key_in” is unique. The “_key_out” value for customer names Matthew Harris and Matthew Harri is same. This is our target to eliminate duplicates.

The next step is to use a conditional split to identify the unique records.


As we need to identify the unique records the above condition will be sufficient. However, if you need to log duplicate records you can insert them into a table using the Not equal condition.

With that we have completed data cleaning using SSIS, the following diagram is the entire package. Please note that I have not included DateSK lookup.


Conclusion

In the real-world, data is dirty and cleaning will take consume a lot of resource. Using SQL Server Integration Service’s new tools, fuzzy lookup and fuzzy grouping, you have the higher degree of control over the cleaning process.

]]>

Leave a comment

Your email address will not be published.