CDC and Data Warehouse

In a data warehouses, you have separate dimensions for dates as they are used for drill downs. Therefore we need to have datejoinsk in the dimcustomer that references the dimdate dimension.

Now the task is to update dimcustomer with incremental update.

Using CDC
As we are using CDC for this, we need to enable the CDC for the tblcustomer table in the OLTP system. Before enabling the table for CDC, you need to enable the database for CDC, which is done using the following script.

–Enabling Databases for CDC
EXEC sys.sp_cdc_enable_db
After that you can enable table for the CDC using following script.
EXECUTE sys.sp_cdc_enable_table
    @source_schema = N’dbo’
  , @source_name = N’tblCustomer’
  , @role_name = N’cdc_Admn’

For net change support in CDC, you need to specify the unique key. In the case you have not specified a unique key, it will use the existing primary key as the default unique key. If you don’t have a primary key, you must specify the unique otherwise, the above script will fail.

After executing the above script, you can see that an additional table-value function named fn_cdc_get_net_changes_dbo_tblCustomer will be created.

After enabling CDC, let us do some changes to the existing records.

Insert into tblCustomer (Name,AddressI,AddressII,AddressIII,DateJoin,Location) Values (‘Phil Sinclair’,’34 Bellew Streat’,’Scarbrough’,’POB 8976′,’2006-11-28′,’YO’)
Delete From tblCustomer Where ID = 5
Update tblCustomer Set AddressIII = ‘POB 1974′ Where ID = 7
Update tblCustomer Set DateJoin = ’2007-12-13′ Where ID = 8
Update tblCustomer Set DateJoin = ’2007-01-13′Where ID = 11

The next task is to create a table to store the last extracted date and time. By saving the last extracted date and time, you can execute the data extraction package at any time, several times per day or one time per week as you wish.

CREATE TABLE LastUpdateDateTime
([Table] Varchar(50),
LastUpdateDateTime SmallDateTime)

The above script creates the LastUpdateDateTime table where you can save records for all the other tables.

Insert into LastUpdateDateTime values (‘Customer’,Getdate()-1)

The above script will insert records to the customer table.

The next is to create a table value function to return the net changes to records after the last extraction.

CREATE function CDC.uf_Customer (
returns @Customer table (
     [__$operation] int, ID int,Name varchar(50),AddressI varchar(50),AddressII varchar(50),AddressIII varchar(50),DateJoin smalldatetime,Location varchar(15)
) as
        ,@from_lsn BINARY(10)
        ,@min_lsn BINARY(10)
        ,@to_lsn BINARY(10)
        ,@row_filter NVARCHAR(30)
SELECT @min_lsn = sys.fn_cdc_get_min_lsn(‘dbo_tblCustomer’)
SELECT  @begin_time = LastUpdateDateTime From lastUpdateDatetime WHERE [table] = ‘Customer’
IF @begin_time IS NULL
    SET @from_lsn = @min_lsn
 select @from_lsn = sys.fn_cdc_increment_lsn(sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’,@begin_time))
IF @min_lsn > @from_lsn OR @from_lsn IS NULL
   SET @from_lsn = @min_lsn
SELECT @to_lsn = sys.fn_cdc_get_max_lsn()
SET @row_filter =’all’
INSERT INTO @customer
SELECT [__$operation]
,ID ,Name ,AddressI ,AddressII ,AddressIII ,DateJoin ,Location
 FROM cdc.fn_cdc_get_net_changes_dbo_tblCustomer
(@from_lsn, @to_lsn,@row_filter)

If you execute this function now, you will get following results.

In this image, the most important row is row number 4. Row number 4 was created from one insert and one update. If you look closely you can see that both changes are incorporated in the function.

SSIS Package
The next task is to create a SQL Server Integration package. The followings are the steps you need to carry out to build the SSIS package.

  1. Create a new SSIS project and add a new SSIS package.
  2. Drag and drop Data Flow task to Control flow and double click the Data Flow Task.
  3. Create a OLEDB connection which points to the source database. (Example: InvoiceSystem).
  4. Add an OLE DB Source to the Data Flow Task. Select SQL Command and Data access method and enter SELECT * FROM CDC.uf_Customer()
  5. Connect the output of the OLE DB Source to a multicast. One of the multicast output is set to update LastUpdateDateTime table. Other is used for normal processing.
  6. Lookup is used to get the datesk for the relevant DateJoin.
  7. Next Condition Split used split the data flow into inserts, deletes and updates using _$operation field. Relevant path is attached to the operation you need.

The following is the image of the data flow of the SSIS package:

You can schedule the SSIS package by using SQL Agent Job.

Pages: 1 2 3


No comments yet... Be the first to leave a reply!