Site sponsored by: Idera The gold standard of SQL Server performance monitoring & diagnostics.
SQL Server Performance

  • Home
  • Articles
  • Forums
  • Tips
  • Quiz
  • FAQ's
  • Blogs
  • Software
  • Books
  • About Us
RSS Feeds
Sign in | Join


Article Topics

All Articles
Performance Tuning
Audit
Business Intelligence
Clustering
Reporting Services
Developer
General DBA
ASP.NET / ADO.NET

Write for Us

Share your SQL Server knowledge with others and raise your profile in the community More...
Latest Articles

System Data Collection Reports
Recover Data Using Database Snapshots
Analyze and Fix Index Fragmentation in SQL Server 2008
Powerful Geographical Visualisations made easy with SQL 2008 Spatial (Part 2) ...

More     
 
Latest FAQ's

How to alter a User Defined Data Type?
How to unzip a File in SSIS?
How to view previous query plans?
ALTER TABLE SWITCH statement failed because the object '%.*ls' is not ...

More     
   
Latest Software Reviews

Spotlight on ApexSQL Doc 2008
ApexSQL Enforce
Embarcadero Change Manager
SQL Server DBA Dashboard

More     

articles >> general dba >> Data Cleansing with SSIS

Data Cleansing with SSIS

By : Dinesh Asanka
Sep 15, 2007

Page 3 / 3

Next is Advanced tab.


The Advanced tab contains the settings for controlling the fuzzy logic algorithms that assign groupings to text in the input stream. You can set the names of the three additional fields that will be added automatically to the output of this transformation. These fileds are named “_key_out”, “_key_in” and “_score” by default. As the rows flow through the pipeline, Fuzzy Grouping assigns an ID to each of them, which is the “_key_in”. When Fuzzy Grouping groups a collection of rows, it determines which row should be the representative. It then assigns all the “_key_out” columns of the rows in the group to the “_key_in” value of the representative. As a result, if you want to write only the representative rows to your output, you can filter the Fuzzy Grouping output through a conditional split transform that selects only rows where “_key_in” equals “_key_out”.

The token delimiters can also be set if, for example you don't want the comparison process to break incoming string by . (dot) or spaces. The default for this setting is all common delimiters.

To visualize what is happening from a fuzzy lookup grouping, I have added two data grid views, one before and one after the fuzzy group.


We had three non-matching customers which are shown in the top data viewer grid.

The next data grid is the data grid is generated after the Fuzzy Group. You can see that “_key_in” is unique. The “_key_out” value for customer names Matthew Harris and Matthew Harri is same. This is our target to eliminate duplicates.

The next step is to use a conditional split to identify the unique records.


As we need to identify the unique records the above condition will be sufficient. However, if you need to log duplicate records you can insert them into a table using the Not equal condition.

With that we have completed data cleaning using SSIS, the following diagram is the entire package. Please note that I have not included DateSK lookup.


Conclusion

In the real-world, data is dirty and cleaning will take consume a lot of resource. Using SQL Server Integration Service’s new tools, fuzzy lookup and fuzzy grouping, you have the higher degree of control over the cleaning process.


<< Prev Page         








Home | Peformance Articles | Audit Articles | Business Intelligence Articles | Clustering Articles | Developer Articles | Reporting Services Articles | DBA Articles | ASP.NET / ADO.NET Articles | DBA FAQ's | Developer Peformance FAQ's | DBA Peformance FAQ's | Developer FAQ's | Clustering FAQ's | Error Messages | Audit Tool Reviews | Backup Tool Reviews | Coding Tool Reviews | Compare Tool Reviews | Documentation Tool Reviews | Design Tool Reviews | Monitoring Tool Reviews | Log Tool Reviews | Reporting Tool Reviews | Clustering Tool Reviews | Security Tool Reviews | Change Management Tool Reviews | Remote Access Tool Reviews | Book Reviews | Security Tool Reviews | QDPMA Performance Tuning | ADO.NET / ASP.NET | Administration | Analysis/OLAP Services | Application Development | Configuration | Components | ETL | Hardware | High Availability | Hints | Index | Misc | Operating Systems | Performance Tuning | Replication | T-SQL | Views


              © 1999-2008 by T10 Media. All rights reserved