Data cleansing (or ‘data scrubbing’) is detecting and then correcting or removing corrupt or inaccurate records from a record set.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been caused by different data dictionary definitions of similar entities in different stores, or caused by user entry errors or data which was corrupted in transmission or storage. Preprocessing the data will also guarantee that it is unambiguous, correct, and complete.
The actual process of data cleansing may involve removing typos or validating and correcting values against a known list of entities. The validation may be strict such as rejecting any address that does not have a valid ZIP code or fuzzy such as correcting records that partially match existing, known records. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.
Example
Let’s examine a real-world scenario. In case of a data warehouse, let us assume that we are going to load invoices of customers’ into a database. We will have the following data which we need to load to our data warehouse database..
Date, Product Code, Product Description, Customer Name , Quantity and Unit Price are the data that we are receiving as a data source. Let us assume that, that we have them in a text file. In the data warehouse database, we will have four tables namely dimCustomer, dimDate, dimProduct and fctSales and those will have following relationship.
So we have task of updating these tables with the data we are receiving from the text file. CustomerSK needs to be identified by performing a lookup on the dimcustomer table by using a customer name while dateSK needs to be identified by doing a lookup to dimdate with date filed. Similarly, productSK needs to be identified from dimproduct and you have product code and product name to find the productSK.
We will start the task by creating a SQL Server Integration Service (SSIS) project and then add a data flow task to the control flow. In the added data flow task, we will start the operations by adding a flat file source and pointing the csv file to the flat file source.
Using Lookup
The next step is to identify the relevant productSK for each record. We have two parameters to find the correct productSK – product code and product name. We can first use product code to lookup dimproduct table and if that fails we can use product name to get the productSK.
The above package diagram is the method of identifying the productSK from the dimproduct table. It starts with 12 records. When it passes through the first lookup,(Lookup by Product code), it found 9 records which matches a record in the product table. Below image is the lookup property screen for lookup by product code.
Productcode is used as the join column to get the productSK. In the reference table tab the reference table is dimProduct. You can use a table or a view as a reference, you can even use a query for a reference. For the optimal performance it is best to have an index on the column which you are going to lookup.
From the above lookup there are three rows which couldn’t find matching records from the dimproduct table for the product code. Those three records are passed to another lookup. This time product name is used for column lookup. This lookup is implemented in the same manner as the product code lookup.
After this look up you can see that another two records were found. These two records and the previous nine records were taken into one stream by using a Union data flow task. However, there is another record which does not match either its product code or product name. This record is logged to another table with a new column introduced from the No Product Code derived column control. The purpose of this is to identify or correct this record manually and reload into the system at a later time.
To reduce error records, use as many as parameters as possible. In this case, to find the productSK, we have use two parameters – product code and product name.
Fuzzy Lookup
After finding the productSK the next task is to find customerSK. The below diagram shows the implementation of the method for finding customerSK which will be explained in detail.
We have one parameter to identify the customerSK, which is customer name. As with the previous case, we start the process by a simple lookup. You can see that out of 11 records , 8 records were matched against the customer name. However, there are three records which didn’t match.
There are two reasons for this:
- no customer record exists.
- customer name is miss-spelled in the source.
Jonh Sward, Smith Warren and Matthew Harris are the customers which are not matched with the dimcustomer table. If you analysed the customer table there is no record for Matthew Harris. Nevertheless, there are records in the dimcustomer table for John Sward and Smith Allen, which are closely match with missing records.
So you can see that Matthew Harris falls into the first category which is no record in the customer table and Jonh may be spelled wrongly instead of John which is falls into the second category. However, it is not certain whether the Smith Warren a new customer or if it is a mis-spelling of Smith Allen.