Data Cleansing with SSIS
The next question is how to identify the mis-spelled customers. For this, a fuzzy lookup can be used. Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table. Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs. If you take the name as a example, if your reference data contains John Sward, the ETI will contain entries for John and Sward. In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors. The more unique tokens and the more rows in the reference table, the more entries and longer the occurrence lists in the ETI. The ETI will be roughly as big as your reference table.
You can find all the details of fuzzy lookup ad how it works at http://msdn2.microsoft.com/en-us/library/ms345128.aspx.
The following screen will appear after selecting properties of fuzzy logic. Here, you can specify the reference table in which you have “clean” data. You have the option of saving the index into a table so that it uses that table in future without re-building the. In this way you can improve the performance of fuzzy logic.
From the columns tab you can define the columns you need to match.
The Advanced tab is an important tab when defining your fuzzy logic.
In case of dirty data, you may get multiple records. So you can define number of records that should return from the fuzzy lookup. Note that increasing the value of this property may increase the time it takes to process each input row.
Next is allocation of the Similarity threshold. Allocation of this totally depends upon your data.
Back to the SSIS package again. The three records will now be applied to fuzzy logic. To analyse, I have added a data viewer to the package just after the fuzzy logic. Below is the grid view:
You can see that two additional columns have been introduced namely _Similarity and _Confidence. As SSIS performs the fuzzy lookup, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
So selecting correct similarity and confidence totally depends on your application and dirtiness of your data. Before selecting a value better to have an idea of dirtiness of your data.
By using a conditional split transformer you can perform any actions you need.
In the above conditional data split we have defined accepting data _Similarity>0.65 and _Confidence >0.9 data. Data which does not fall into this category but has _Similarity>0.5 and _Confidence >0.8 are likely to be matched (although these matches may need manual intervention). It is best to log these into a screening table. At the end of the process, you can send a notification or generate a report using the Send Mail Task.
Then what about the data which does not fall into either of these categories? Those may be new customers. Normal practise, however, would be to log that data to a table and then later to manually deal with them. However, we will assume that in this case we are going to insert that record with new customer record.
You can insert a customer record and perform another lookup to identify customerSK for the new customer.
Even though it is simple lookup as we performed earlier, there is an important point to remember. In a Lookup there is a property called CacheType which is Full by default.
The type of caching can be,
full – the complete reference dataset is read before the input is processed.
partial – the transformation; limits the size of the cache to a specified size
none – each lookup results in a new query to the database
As you can see when the full option is set, the dataset is read before the input is processed. As we have used Customer lookup previously, when the next lookup is performed it uses the previously cached one. As we have inserted a new customer record, that record will not be retrieved by the subsequent lookup when the cache type is set to Full.
The Full cache type has the potential to give your package a performance boost, especially if the values you need to lookup are often repeated in the input. However, in this case we have to assign none so that it will perform a new query against the database.
When standardized reference data is not available, fuzzy grouping develops a master set of unique records from a data set containing a combination of unique and duplicate records. For example, you may receive your customer data from different data sources.
If we consider our previous example, when we are trying to insert new customer records, there can be duplicated records. So we have to get the unique record rather than inserting duplicate records. To do this, you will need to drag a Fuzzy Group before the Insert New Customer Records transformation.
Configuration a Fuzzy group is a simple task.
The first tab, Connection Manager is used to assign the connection. The next tab, columns is to set the columns that you need to group.
You will need to select the columns that you will group. There are two parameters which are important when configuring fuzzy grouping – Numerals and Comparison Flags.
The numerals option allows confuguration of the significance of numbers in the input stream when grouping text logically. The options are to consider leading, trailing , leading and trailing, or neither leading nor trailing numbers significant. This option would need to be considered when comparing address or similar types of information. Comparison flags have a few more options for grouping.