SAP CDC with Azure Data Factory

Using a self-hosted Integration Runtime

Ana Escobar
datamindedbe
7 min readJan 16, 2024

--

Image generated with Midjourney

What is CDC?

CDC stands for Change Data Capture, and it’s a technique used to track changes in data so that they can be propagated to other systems or applications in real-time. This process is crucial for maintaining data consistency across different parts of a system or between different systems.

There are two main types of CDC: physical and logical. Physical CDC involves capturing changes at the database’s low-level, such as changes to individual records or table structures. Logical CDC, on the other hand, captures changes at a higher level, focusing on the meaning or content of the data, making it more suitable for scenarios where the structure of the data may evolve.

SAP CDC is a type of logical Change Data Capture. It focuses on capturing changes in the meaning or content of the data rather than the low-level details of the database structure.

After this brief introduction to what is Change Data Capture (CDC), let me explain to you how to use Azure Data Factory to build an SAP CDC pipeline using Azure’s SAP Connector, and what things you need to take into account.

Once we create a Data Factory in Azure and we launch the application, we will see 5 different resources:

  1. Pipelines
  2. Change Data Capture
  3. Datasets
  4. Data Flows
  5. Power Query

We will focus on the 3 ones highlighted in bold.

Datasets

Here is where we define the source and sink data. When creating a new source dataset we will select SAP CDC, give it a name, and select or create a new linked service. This linked service is how we connect to our SAP data using an Integration Runtime (IR) which is the compute infrastructure used by the Data Factory to provide data integration capabilities across different network environments. We go for the self-hosted Integration Runtime option which requires a VM with the SAP .NET connector installed. Microsoft documentation is very helpful in this case, this page explains how to set it up. The rest of the configuration for the linked service can be completed with the SAP credentials to be able to access the views.

After specifying the linked service, we can select an ODP context, like for example SAPI. Then select the ODP name, the view we want to extract the data from. The view must have the CDC option enabled on the SAP side, otherwise we won’t be able to perform incremental loads of the data.

For the sink, we need to create another dataset. But this time, it’s meant to be of that type we want the output format with the data (the data changes) to be. For this example let’s suppose we select Azure Data Lake Storage > DelimetedText (CSV). Here, we have to provide a file path where we want to store this data. I recommend having a root folder e.g. sap-cdc and inside another folder for this replication data e.g. replication-{view_name} since later we will see that we also need to store the staging data to be able to do incremental loads.

We need to create one dataset source and one dataset sink per SAP view that we want to track changes from.

Data Flows

It’s where we define the activity the pipeline will follow. We will start selecting the source we just configured in our datasets. Important under Source options select the Run mode Full on the first run, then incremental since we only want to get the changed data on each execution. We will need to provide a Key column, we can select any column ID our view has.

After that, we can apply a set of operations like filtering, selecting columns, sorting data, flattening data, etc. Let’s suppose we only want those records where the isActive column is set to True. We can create a new filter operation, click under Open expression builder, and write an expression like equals(isActive, 'true'). There’s no limit to the amount of operations a pipeline can have, so feel free to create as many operations as needed. The last operation should be the sink linked to our previously created sink dataset.

Data Flow in Azure Data Factory

Pro tip: we can change the final name of the CSV file by clicking the sink operation > Settings > File name option set to Pattern, under Pattern click on Open expression builder and write something like concat(‘view_name’, toString(currentTimestamp(), ‘yyyyMMddHHmmdd’, ‘Europe/Brussels’), ‘.csv’). This specifically will create a CSV file named with the view_name (static part) followed by the execution timestamp in the specified format every time the data flow is triggered.

Again, we need to create as many data flows as SAP views we want to track changes from. But, why don’t we create just one data flow and as many flows inside it as needed? We might think that it will lower our costs, but it actually increases them. The duration of the pipeline will also significantly increase. Only if all flows succeed the pipeline will succeed, otherwise, it will be marked as failed. Another big disadvantage is that we will only have one staging folder per Data Flow, so all staging data will be in the same folder (not the best approach). In the staging folders is where we store the metadata for the initial and delta loads so that the pipeline knows what has changed since the last run — from here the Change Data Capture name.

The approach was tested though, and in the image below we can see when exactly it was deployed:

Increase of costs using a single Data Flow with 6 flows

Pipelines

Last, but not least, we need to create a pipeline resource so we can trigger our data flow. To do this, we just drag & drop the data flow we created into the pipeline resource. As part of the configuration, we need an Azure Integration Runtime, unfortunately, we cannot use the self-hosted IR we previously created. This will increase the costs since we need a pipeline per data flow, which means we will start an Azure IR per data flow — now you might understand why having a single data flow with multiple flows can be thought of as lowering the costs.

In this step, we also need to set up the staging storage folder. The same root folder e.g. sap-cdc and a specific folder for each view e.g. staging-{view_name}, so we follow the same naming convention as with the replication folder.

Finally, we can trigger it now to see if the pipeline is behaving as expected and once checked, create a new Trigger to schedule the execution of the pipeline every x amount of time, e.g. every 15 minutes.

Conclusion

Developing an SAP CDC pipeline with Azure Data Factory incurs high costs, especially due to Integration Runtimes. Every time we initiate the Azure Integration Runtime, a (general-purpose) Spark cluster is launched. Consider a scenario where we have 10 or more pipelines running every 15 minutes; this implies that we would be initiating and terminating 10 or more Spark clusters every 15 minutes.

Alternative

One of the many alternatives is Apache Kafka with Confluent Cloud. They offer many SAP connectors, the one that caught my attention was ODP Source Connector from their partner INIT Individuelle.

  • The installation is easy and straightforward, one just needs to download and unzip the connector archive from Confluent Hub.
  • There’s one external dependency -> SAP Java Connector (SAP JCo) library that needs to be copied manually into the connector’s classpath (which can be downloaded from the SAP marketplace).
  • It’s fully integrated with the Confluent Control Center, which offers an interactive web UI.

It is proven that Kafka integrates with SAP systems well. For example, it integrates well with SAP Datasphere, as Kai Waehner mentions in one of his recent blog posts:

SAP Datasphere in conjunction with the data streaming platform Apache Kafka enables a reliable, scalable and open data fabric for connecting SAP business objects of ECC and S/4HANA ERP

Resources

Do you like this content?

Subscribe to my medium page and be the first to get notified whenever I publish a new one!

Follow me on LinkedIn for daily insights about Software & Data Engineering 🫰🏻

--

--

Ana Escobar
datamindedbe

Galician 🖖🏼 | Data enthusiast, passionate about Event-Streaming platforms | Software Engineer in the Cloud Infra at Tinder | ana-escobar.com