IN2BI, Microsoft Business Intelligence
data, stories & insights
rss | email | twitter

Change Data Capture (CDC) and BIML

Change Data Capture (CDC) is an enterprise feature of SQL Server since version 2008. 12 It's a non not very intrusive way to collect only the changes in selected tables of your source system.3 And thus providing a way to effectively handle your data warehouse load.

Integration Services (SSIS) and CDC

To handle these changes you can roll your own SSIS-CDC system where you choose where you save the state of the rows that you handled.4  Since 2012 it's more convenient to use the new CDC components that Microsoft shipped with SQL Server 2012.

There are three CDC components:

  • CDC Control task is used to control the life cycle of change data capture (CDC) packages. It handles the management of Log Sequence Number (LSN) ranges that are processed in a run of a CDC package.
  • CDC Source, reads a range of change data from change tables and delivers the changes downstream to other SSIS components.
  • CDC Splitter, splits a single flow of rows from a CDC source data flow into different flows of rows for Insert, Update and Delete operations

BIML

These new components are implemented as custom components and aren't natively handled in BIML. So to use them in a BIML solution you have to do some extra work.

Whenever I create a BIML solution I have a three step approach:

  1. Create a package in SSIS (Or at least a design)
  2. Create a BIML file that can create the package (without script)
    I call this the vanilla version.
  3. Gather meta data if not already available and add script to the BIML file(s) to create all the packages

For this blog post I'll focus on step 2. Matt Mason of Microsoft already wrote a nice walkthrough that serves well for step 1. Step 3 is something for a future blog post or a sponsor.

Preparing the environment

If you didn't follow the walkthrough by Matt Mason you need to create the environment with SQL Server Management Studio:

  1. Create a database (CDCTest)
  2. Create the DimDepartmentGroup_CDC table that will be used as a source.
  3. Enable CDC for the database
  4. Add a primary key to the table created in step 2.
  5. Enable CDC for the DimDepartment_CDC table that was created in step 2
  6. Create a destination table
  7. Create a table to store the cdc state

 

If you did follow the walkthrough skip step 1,3 and 7

The necessary T-SQl code (available at GitHub Gist):

Creating the BIML file for the initial load

The initial load package is relatively simple. This is an overview of the resulting package with the CDC Components marked in yellow.

Control Flow Data Flow
image image

 

The necessary BIML code to create the initial load package
(available at GitHub Gist):

The hard part in this code is getting the necessary information for the CDC Control Task. You'll typically find itin the XML code of the vanilla package I described earlier. Open this package by clicking View Code in the context menu (F7) and find the CDC ControlTask

image

In this xml snippet you'll find the CreationName, an required attribute of the CustomTask in Biml. The marked ObjectData snippet can be copied into an CData tag within the ObjectData node of the CustomTask.

Creating the BIML file for the incremental load

In the incremental load we will use the CDC Data Flow Components. A complicating factor in creating the BIML file is the necessity of meta data of columns. Column information can normally be queried by BIML/SSIS. For these custom components we have to provide them. Before creating the BIML let's look at the resulting package:

Control Flow Data Flow
image image

Some remarks:

The CDC Control Tasks will typically handle changes of more than one table. For now I've implemented this using a separate sequence container for the DepartmentGroup table. You can create additional containers for extra tables or create several packages. Then you will typically move these control tasks to a master package.

Changed- and deleted rows are handled in a batch using staging tables. These staging tables are created at the start of the container and dropped at the end.

The necessary BIML code to create the incremental load package
(available at GitHub Gist):

More Information:

  1. MSDN: About Change Data Capture in SQL Server
  2. Whitepaper by Paul Randal: Tracking Changes in your Enterprise Database
  3. SQLCAT Article: Tuning the Performance of Change Data Capture in SQL Server 2008
  4. mssqltips.com blog by Ray Barley: Process Change Data Capture in SQL Server Integration Services (2008)
  5. Matt Mason: CDC in SSIS for SQL Server 2012

» Similar Posts

  1. Four Approaches to Data Warehousing
  2. Building a data warehouse while the source systems are still in development
  3. Create a LightSwitch Application to Maintain Metadata

» Trackbacks & Pingbacks

    No trackbacks yet.

» Comments

    There are no comments.

Comments are closed