Site sponsored by: Idera Try Idera’s new SQL admin toolset
SQL Server Performance

  • Home
  • Articles
  • Forums
  • Tips
  • Quiz
  • FAQ's
  • Blogs
  • Software
  • Books
  • About Us
RSS Feeds
Sign in | Join


Article Topics

All Articles
Performance Tuning
Audit
Business Intelligence
Clustering
Reporting Services
Developer
General DBA
ASP.NET / ADO.NET

Write for Us

Share you SQL Server knowledge with others and raise your profile in the community More...
Latest Articles

Capture DDL Changes using Change Data Capture with SQL Server 2008 ...
Business Intelligence in Collaborative Planning, Forecasting and Replenishment
Inside SQL Server Cluster Setup and Troubleshooting Techniques - Part I ...
Configure and Manage Policy Based Management in SQL Server 2008 ...

More     
 
Latest FAQ's

Cannot Start SQL Server Service
Users are able to connect to report manager but not able ...
Errors when SQL Server Snapshot Replication is Running
How to Display Server Name or IP Address in a Reporting ...

More     
   
Latest Software Reviews

Spotlight on ApexSQL Doc 2008
ApexSQL Enforce
Embarcadero Change Manager
SQL Server DBA Dashboard

More     

articles >> general dba >> Data Cleansing with SSIS

Data Cleansing with SSIS

By : Dinesh Asanka
Sep 15, 2007

Page 2 / 3

The next question is how to identify the mis-spelled customers. For this, a fuzzy lookup can be used. Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table. Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs. If you take the name as a example, if your reference data contains John Sward, the ETI will contain entries for John and Sward. In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors. The more unique tokens and the more rows in the reference table, the more entries and longer the occurrence lists in the ETI. The ETI will be roughly as big as your reference table.

You can find all the details of fuzzy lookup ad how it works at http://msdn2.microsoft.com/en-us/library/ms345128.aspx.

The following screen will appear after selecting properties of fuzzy logic. Here, you can specify the reference table in which you have “clean” data. You have the option of saving the index into a table so that it uses that table in future without re-building the. In this way you can improve the performance of fuzzy logic.

From the columns tab you can define the columns you need to match.

The Advanced tab is an important tab when defining your fuzzy logic.


In case of dirty data, you may get multiple records. So you can define number of records that should return from the fuzzy lookup. Note that increasing the value of this property may increase the time it takes to process each input row.

Next is allocation of the Similarity threshold. Allocation of this totally depends upon your data.

Back to the SSIS package again. The three records will now be applied to fuzzy logic. To analyse, I have added a data viewer to the package just after the fuzzy logic. Below is the grid view:


You can see that two additional columns have been introduced namely _Similarity and _Confidence. As SSIS performs the fuzzy lookup, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.

So selecting correct similarity and confidence totally depends on your application and dirtiness of your data. Before selecting a value better to have an idea of dirtiness of your data.

By using a conditional split transformer you can perform any actions you need.


In the above conditional data split we have defined accepting data _Similarity>0.65 and _Confidence >0.9 data. Data which does not fall into this category but has _Similarity>0.5 and _Confidence >0.8 are likely to be matched (although these matches may need manual intervention). It is best to log these into a screening table. At the end of the process, you can send a notification or generate a report using the Send Mail Task.

Then what about the data which does not fall into either of these categories? Those may be new customers. Normal practise, however, would be to log that data to a table and then later to manually deal with them. However, we will assume that in this case we are going to insert that record with new customer record.

You can insert a customer record and perform another lookup to identify customerSK for the new customer.

Even though it is simple lookup as we performed earlier, there is an important point to remember. In a Lookup there is a property called CacheType which is Full by default.

The type of caching can be,

full - the complete reference dataset is read before the input is processed.

partial - the transformation; limits the size of the cache to a specified size

none - each lookup results in a new query to the database

As you can see when the full option is set, the dataset is read before the input is processed. As we have used Customer lookup previously, when the next lookup is performed it uses the previously cached one. As we have inserted a new customer record, that record will not be retrieved by the subsequent lookup when the cache type is set to Full.

The Full cache type has the potential to give your package a performance boost, especially if the values you need to lookup are often repeated in the input. However, in this case we have to assign none so that it will perform a new query against the database.

Fuzzy Grouping

When standardized reference data is not available, fuzzy grouping develops a master set of unique records from a data set containing a combination of unique and duplicate records. For example, you may receive your customer data from different data sources.

If we consider our previous example, when we are trying to insert new customer records, there can be duplicated records. So we have to get the unique record rather than inserting duplicate records. To do this, you will need to drag a Fuzzy Group before the Insert New Customer Records transformation.

Configuration a Fuzzy group is a simple task.

The first tab, Connection Manager is used to assign the connection. The next tab, columns is to set the columns that you need to group.


You will need to select the columns that you will group. There are two parameters which are important when configuring fuzzy grouping - Numerals and Comparison Flags.


The numerals option allows confuguration of the significance of numbers in the input stream when grouping text logically. The options are to consider leading, trailing , leading and trailing, or neither leading nor trailing numbers significant. This option would need to be considered when comparing address or similar types of information. Comparison flags have a few more options for grouping.


<< Prev Page     Next Page>>    








Home | Peformance Articles | Audit Articles | Business Intelligence Articles | Clustering Articles | Developer Articles | Reporting Services Articles | DBA Articles | ASP.NET / ADO.NET Articles | DBA FAQ's | Developer Peformance FAQ's | DBA Peformance FAQ's | Developer FAQ's | Clustering FAQ's | Error Messages | Audit Tool Reviews | Backup Tool Reviews | Coding Tool Reviews | Compare Tool Reviews | Documentation Tool Reviews | Design Tool Reviews | Monitoring Tool Reviews | Log Tool Reviews | Reporting Tool Reviews | Clustering Tool Reviews | Security Tool Reviews | Change Management Tool Reviews | Remote Access Tool Reviews | Book Reviews | Security Tool Reviews | QDPMA Performance Tuning | ADO.NET / ASP.NET | Administration | Analysis/OLAP Services | Application Development | Configuration | Components | ETL | Hardware | High Availability | Hints | Index | Misc | Operating Systems | Performance Tuning | Replication | T-SQL | Views


              © 1999-2008 by T10 Media. All rights reserved