What is Change Data Capture?

Andy MacConnell
8 min readJan 24, 2024

--

Change Data Capture (CDC) refers to the process of generating and storing data that has changed in some way within a data store.

In this article I’ll discuss what Change Data Capture is, why it could be a valuable tool in your architectural toolkit and walk you through getting it up and running using Microsoft SQL Server within a Docker environment.

Lets dive in!

Note: If you wish to follow along, ensure you have a container based runtime installed ✅

Photo by Mathew Schwartz on Unsplash

Lets create an ubiquitous language first

As with most things in the technology and software world, navigating the language used to describe something can be one of the most challenging things, so lets define the language used within this article here so that future use of terms isn’t as confusing.

CDC: Change Data Capture

MSSQL: Microsoft SQL Server

Source Table: The table we are tracking change within

LSN: Log Sequence Number

What is CDC?

Lets start with the definition as provided by Qlik:

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

So, given data in a database, when data is changed within a given table, being tracked, then the change is captured and made available for processing by another process or system.

Pretty cool right?

Why could capturing change data be valuable?

There can be many benefits to capturing, and ultimately acting upon, change data captured within a given database. These can range from simply creating an audit log of change within your database to creating real time analytical dashboard or triggering business workflows, moving to a more event driven architecture and much more. Lets attempt to illustrate with an example.

Lets say that you are working for a business which owns an online store. The team you work within is responsible for user administration and has been tasked with automating what is currently a manual process that happens once week. The process itself entails going through each of the new user registrations within your site and sending out welcome letters to each with a voucher. As part of this change, the product team have also expressed that there is a need to have an aggregated view, in real time, of new account registrations within the store.

Some constraints that are currently in place are:

  • We’re strangling out the existing monolith and there is an embargo on extending its’ functionality
  • We can’t change the schema of the database but do need to be made aware of when data within it changes

So could we tackle this using CDC?

In short, yes!

By enabling the capturing of change data here, we could begin to track any changes to our User table. The CDC process would insert a “created” operation entry into our change tables which we could begin to take action on within new consuming services responsible for carrying out the distribution of our communications. With the changes being emitted in real time as they happen, we could also look at creating dashboards, or aggregate and projecting change events elsewhere for our analytics to target as their source data.

Note: I will build upon this later (particularly in later articles on how we can move this data away from the database entirely). For now, this is just for illustration purposes. Key takeaway here should be the availability of, and visibility on, what has changed within your data as well as what it could be used for based on your own use cases and business needs.

What does change data look like I hear you ask!?

We mentioned above in our example that the database would insert a “create” operation entry into our source change table. But what does this entry look like?

Within Microsoft SQL Server, the change table will largely take on the shape of your target table schema, with the addition of a few extra metadata columns. So if we were to track change within a User table within the dbo namespace, which had the schema:

User Schema — Generated Using Mermaid CLI

SQL server will create a table named [cdc].[dbo_User_CT] with the following schema:

User Change Table Schema — Generated Using Mermaid CLI

Note: I had to add an alias to each metadata column in the above image due to lack of character support for columns containing a $ within Mermaid

Now, if you like me, the first time you lay your eyes on the change table where like:

Then you are not alone!

Lets look at those metadata columns in more detail as described by Microsoft.

__$start_lsn

This is the commit log sequence number that was assigned to the change when it was committed. As SQL uses an internal transaction log for any changes committed to it, this corresponds with that log sequence commit.

For more information on SQLs Transaction Log architecture, please reference this MSDN article.

__$end_lsn

As of SQL Server 2012, this column is always null but always present, so we can disregard this for now.

__$seqval

This represents the sequence of the operation as represented in the transaction log. Despite alluring to the fact it can be used for ordering your change events, Microsoft advise against this and to __$command_id instead.

__$operation

Denotes the operation that occured for the given change. The integers for each operation map to the following:

1 => Delete

2 => Insert

3 => Update (old values i.e existing value was changed)

4 => Update (new values i.e a value was added to a column that never had a value previously)

__$update_mask

A bit mask that is able to identify those columns that have changed.

__$command_id

The order of operations within a given transaction per change.

Note: Microsoft change data schema definitions can be found here

Running a CDC enabled MSSQL instance within Docker

Ok, enough theory, lets create a CDC enabled instance locally using Docker!

Note: If you wish to do this directly in SQL Server Management Studio, you can run the SQL scripts below which enable change tracking on both the database and desired tables, skipping Docker entirely.

Photo by Ian Taylor on Unsplash

Lets first create our Dockerfile, using the official MSSQL image as our base:

FROM mcr.microsoft.com/mssql/server:2019-latest

USER root

ENV ACCEPT_EULA=Y
ENV MSSQL_SA_PASSWORD=YOURSUPERSTRONGPASSWORD
ENV MSSQL_PID=Developer

COPY ./scripts /tmp/
RUN chmod -R +x /tmp/

RUN /opt/mssql/bin/mssql-conf set sqlagent.enabled true

ENTRYPOINT /tmp/entrypoint.sh

Lets walk through what is happening here line by line:

FROM mcr.microsoft.com/mssql/server:2019-latest

Here we are stating we wish to build our image from the MSSQL image hosted within the Microsoft Artifact Directory.

USER root

Set the current user as root for elevated permissions (don’t do this in production…)

ENV ACCEPT_EULA=Y
ENV MSSQL_SA_PASSWORD=YOURSUPERSTRONGPASSWORD
ENV MSSQL_PID=Developer

Here we are setting some environment variables within our container based on Microsoft's’ required configuration.

For CDC support, the minimum PID required for SQL Server is Developer . This is due to requiring the SQL Agent which will be responsible for running CDC jobs.

COPY ./scripts /tmp/
RUN chmod -R +x /tmp/

Here we are copying a directory containing our SQL Server scripts to configure CDC later when we override the containers entrypoint. We are also elevating the permissions on the tmp directory recursively so we can execute the scripts inside.

RUN /opt/mssql/bin/mssql-conf set sqlagent.enabled true

By default, the SQL Agent will not be enabled on container start up. Here, we are enabling the SQL Agent so our future CDC configuration queries and jobs will be successfully executed.

ENTRYPOINT /tmp/entrypoint.sh

Here we are overriding the entrypoint of the container so that we can run our configuration based SQL scripts on container start up.

Entrypoint.sh contains the following:

#!/bin/bash

/tmp/configuredb.sh & /opt/mssql/bin/sqlservr

which will run our configure.sh and sqlserver executable. configuredb.sh contains:

# Wait for SQL Server to start (10 seconds)
sleep 15

/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ${MSSQL_SA_PASSWORD} -i /tmp/create-database.sql
/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ${MSSQL_SA_PASSWORD} -i /tmp/create-user-table.sql
/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ${MSSQL_SA_PASSWORD} -i /tmp/enable-cdc.sql

Here we are waiting on SQL server to initialise using a simple sleep for 15 seconds (this could be much more sophisticated!) then executing scripts which will scaffold our CDC enabled database.

IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'CDCDB')
BEGIN
USE master
CREATE DATABASE [CDCDB];
PRINT 'db Created...'
END

create-database.sql contains the above, simply creating our CDCDB if it does not already exist.

 USE [CDCDB]
GO

IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'User')
BEGIN
PRINT 'CREATING USER TABLE...'
CREATE TABLE [USER](
[UserId] INT IDENTITY(1,1) NOT NULL,
[FirstName] NVARCHAR(50) NOT NULL,
[LastName] NVARCHAR(50) NOT NULL)
END

create-user-table.sql contains the above, creating a table within our CDC database with the schema used in the earlier example.

And finally, enable-cdc.sql which is responsible for enabling CDC on our database instance and configuring our User table to track change data:

USE [CDCDB]
GO

BEGIN
PRINT 'Enabling DB CDC...'
EXEC sys.sp_cdc_enable_db
PRINT 'DB CDC Enabled...'
END

BEGIN
PRINT 'Enabling User Table CDC....'
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'User',
@role_name = NULL;
END

You may notice that sys.sp_cdc_enable_db and sys.sp_cdc_enable_table are both executed. The former enables CDC at the database level, which is a prerequisite for the latter which will enable CDC on our table.

To verify that both are running as expected, you can execute the following queries on master :

SELECT * 
FROM sys.databases
WHERE is_cdc_enabled = 1

SELECT *
FROM sys.tables
WHERE is_tracked_by_cdc = 1

Which should yield a result which states that both our database and table are CDC enabled and currently tracking!

You can then begin to insert, update or delete data within the User table and query (not advised out with development!) the cdc.dbo_User_CT change table to view the change data!

All code can be found on GitHub here.

Following up

With Change Data now being emitted from our database, we can begin to take action on the data being changed, as its changing. I will cover this off in a future article using technology from the Apache stack and dotnet.

Resources

--

--