CDC and Data Warehouse
In a data warehouses, you have separate dimensions for dates as they are used for drill downs. Therefore we need to have datejoinsk in the dimcustomer that references the dimdate dimension.
Now the task is to update dimcustomer with incremental update.
As we are using CDC for this, we need to enable the CDC for the tblcustomer table in the OLTP system. Before enabling the table for CDC, you need to enable the database for CDC, which is done using the following script.
–Enabling Databases for CDC
After that you can enable table for the CDC using following script.
@source_schema = N’dbo’
, @source_name = N’tblCustomer’
, @role_name = N’cdc_Admn’
For net change support in CDC, you need to specify the unique key. In the case you have not specified a unique key, it will use the existing primary key as the default unique key. If you don’t have a primary key, you must specify the unique otherwise, the above script will fail.
After executing the above script, you can see that an additional table-value function named fn_cdc_get_net_changes_dbo_tblCustomer will be created.
After enabling CDC, let us do some changes to the existing records.
Insert into tblCustomer (Name,AddressI,AddressII,AddressIII,DateJoin,Location) Values (‘Phil Sinclair’,’34 Bellew Streat’,'Scarbrough’,'POB 8976′,’2006-11-28′,’YO’)
Delete From tblCustomer Where ID = 5
Update tblCustomer Set AddressIII = ‘POB 1974′ Where ID = 7
Update tblCustomer Set DateJoin = ’2007-12-13′ Where ID = 8
Update tblCustomer Set DateJoin = ’2007-01-13′Where ID = 11
The next task is to create a table to store the last extracted date and time. By saving the last extracted date and time, you can execute the data extraction package at any time, several times per day or one time per week as you wish.
CREATE TABLE LastUpdateDateTime
The above script creates the LastUpdateDateTime table where you can save records for all the other tables.
Insert into LastUpdateDateTime values (‘Customer’,Getdate()-1)
The above script will insert records to the customer table.
The next is to create a table value function to return the net changes to records after the last extraction.
CREATE function CDC.uf_Customer (
returns @Customer table (
[__$operation] int, ID int,Name varchar(50),AddressI varchar(50),AddressII varchar(50),AddressIII varchar(50),DateJoin smalldatetime,Location varchar(15)
DECLARE @begin_time DATETIME
SELECT @min_lsn = sys.fn_cdc_get_min_lsn(‘dbo_tblCustomer’)
SELECT @begin_time = LastUpdateDateTime From lastUpdateDatetime WHERE [table] = ‘Customer’
IF @begin_time IS NULL
SET @from_lsn = @min_lsn
select @from_lsn = sys.fn_cdc_increment_lsn(sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’,@begin_time))
IF @min_lsn > @from_lsn OR @from_lsn IS NULL
SET @from_lsn = @min_lsn
SELECT @to_lsn = sys.fn_cdc_get_max_lsn()
SET @row_filter =’all’
INSERT INTO @customer
,ID ,Name ,AddressI ,AddressII ,AddressIII ,DateJoin ,Location
If you execute this function now, you will get following results.
In this image, the most important row is row number 4. Row number 4 was created from one insert and one update. If you look closely you can see that both changes are incorporated in the function.
The next task is to create a SQL Server Integration package. The followings are the steps you need to carry out to build the SSIS package.
- Create a new SSIS project and add a new SSIS package.
- Drag and drop Data Flow task to Control flow and double click the Data Flow Task.
- Create a OLEDB connection which points to the source database. (Example: InvoiceSystem).
- Add an OLE DB Source to the Data Flow Task. Select SQL Command and Data access method and enter SELECT * FROM CDC.uf_Customer()
- Connect the output of the OLE DB Source to a multicast. One of the multicast output is set to update LastUpdateDateTime table. Other is used for normal processing.
- Lookup is used to get the datesk for the relevant DateJoin.
- Next Condition Split used split the data flow into inserts, deletes and updates using _$operation field. Relevant path is attached to the operation you need.
The following is the image of the data flow of the SSIS package:
You can schedule the SSIS package by using SQL Agent Job.