CDC and Data Warehouse

Introduction
In a data warehouse, the most challenging thus interesting part is the ETL (Extraction, Transform and Loading) process. The challenge comes as you have to work on different databases which were not designed by you.
Most of the time, you need to update your OLAP system from the data changes in the OLTP environment. You cannot truncate the OLAP table, because truncating will results two issues.

  1. Truncate & reloading will consume lot of time if it is a huge table.
  2. Most OLAP designs have surrogate keys which have identity columns. Truncate them and reloading will change the surrogate keys. If surrogate keys got changed, you need to change all the relevant fact table SKs. This will be tedious task.

Need for CDC
Change Data Capture aka CDC is a set of software design patterns to enable a user to track the data that has changed in a database so that actions can be taken using the changed data. In previous versions of SQL Server there was no straight forward way to capture the CDC. To solve this developers had to adopt triggers to capture these changes. Triggers can add some overhead to the database system. You can use third party tools to read the SQL Server log to capture data. SQL Server 2008 has a new feature called Change Data Capture which can be used easily to capture incremental data changes.  As you are aware, CDC will collect all the data changes in a table. Also, there is a feature called net changes in CDC which is tailor made for data warehousing implementation. We will talk about net changes feature in short while. Net Change Feature
Let me explain this by an example for one record:

The above table shows how record ID =1 has changed over time. Rec # 1 shows the insert value while Rec # 2 shows an update on the Date of Join field. Similarly, there is another update at rec # for location. If you consider, all three operation, the net change is following record: When it comes to type 1 Slowly changing dimensions (SCD), you need a net change record which is in the second table. If you don’t have this record you need to derive it from three records which are shown in the first table. However, you need to execute them in exact order which they occurred. With SSIS, this is bit difficult. Whenever you introduce the splitting control for insert, update and delete operations it will run in different threads. This means that update operations will occur before the insert thus the update operation fails or there is no effect, as there are no records at the time of update. In CDC, there is an option to get the net changes records. However, you have to do this at the time of configuring the CDC. Also, you need to have a primary or a business key (in the world of data warehousing) to enable net changes.

Continues…

Leave a comment

Your email address will not be published.