Make your Apache Kafka to SAP data integration a success — considerations and challenges of a Kafka to SAP adapter
When implementing Apache Kafka in an enterprise context, integration to SAP systems is indispensable. In this article, we cover common challenges and considerations that are crucial for the success of your SAP data integration project with Apache Kafka.
Though connecting source systems to Apache Kafka to produce data might be challenging as well, we do not address this in this article. We assume that the sources are set up to send data to Apache Kafka already.
Considerations and challenges
The following topics are particularly relevant when integrating data from Apache Kafka into SAP:
- Data transformation, staging, and semantic correctness
- Order of data (temporal dependency)
- Referential integrity
- Error handling
Before we get into each topic, I would like to briefly introduce our company.
At Xeotek, we offer solutions that help our customers with their Apache Kafka projects. Xeotek SAAPIX is our smart data integration solution for Apache Kafka and SAP. We have developed SAAPIX to help our customers focus on the actual business logic instead of technical problem-solving. Whenever I think it is helpful for the reader, I give some insights about the design decisions we made when developing SAAPIX.
Data transformation, staging, and semantic correctness
An essential part of every data integration pipeline is the data transformation process: the source data models need to be transformed into the target data models.
A good work and cooperation model with the stakeholders involved is of crucial importance for the success of such projects. I refer to an earlier article on data-centric monitoring, which should not be missing in any Apache Kafka project. In the section on error handling, we come back to this topic. Basically, it is a matter of being able to view the data in Apache Kafka at any time by all those involved, in particular, business analysts.
As you might already divide your software landscape into domains (see Domain-Driven Design), the transformation into the target model should be located near the target system and inside the target domain. This way, other potential consumers can reuse the data of the source system in the future, reducing the numbers of point-to-point connections in your landscape. Mainly if composing a data object for the target system from multiple source data objects is required, this decision is crucial.
There are two possible scenarios, often needed side-by-side:
- data is transformed on a record per record basis, or
- data objects need to be composed of data objects from multiple sources.
In SAAPIX, we opted for a plug-in architecture approach for the data conversion components. A component is loaded as a plug-in and is responsible for the data transformation process of a specific target data object. Our customers can develop these components on their own while focusing on the business and transformation logic only. The sources are assigned to the corresponding transformation component using configurations in the SAAPIX database. If the target data object needs to be assembled from data objects spread across topics, multiple sources can be assigned and the data is processed in a batch-style. We call this the staging process.
There can be multiple reasons why data can not be transformed correctly, e.g. semantic incorrectness or corrupt or invalid data (or data models).
It is important to consider in advance how to deal with such errors in the delivery. Depending on your requirements, you may need to abort the whole data loading process or only skip the specific record and forward it to your error handling pipeline. As both scenarios are very common, we have decided to support both strategies in SAAPIX.
Order of data (temporal dependency)
Even if the source systems produce data independently of each other, a temporal (and referential — we get to this later) dependency can implicitly exist.
A simple example to clarify this is a business transaction which relates to a contract. If the contract is not loaded to the SAP system in advance, the business transaction cannot be processed. Therefore, orchestration of the order in which the various data sources are loaded is required.
Note: not every source sends data all the time. To stick to our example: if no new contract exists, the source sends no data at all.
In SAAPIX, you can configure the order of the data records and the strategy if no data is available. Additionally, a notification is sent when no data is available.
Data objects are often related to one another or have fixed references. Ensuring a data record can be fully processed is a challenging task, as only the target system can validate whether all referenced data of a specific data record is present. To prevent the system from being filled with incomplete data, i.e., a referenced data object is not available, the data should not be delivered but instead forwarded for further inspection.
Depending on your error handling, it is reasonable to output an error that indicates the reason and the data for downstream error correction in such cases.
At SAAPIX, this is the last component before the data is delivered to the target system. Though we have developed components for the most common systems, this part is very individual because it depends specifically on the type of data and the respective target system.
Three factors determine a proper error handling capability:
- Transparency: the reason for the error and the corresponding data.
- Correction: can steps be taken to enable further processing?
- Reloading: is it possible to reprocess a corrected data set?
Depending on your requirements, it might also be necessary to reprocess the full delivery with the correct data or implement fallbacks in case new data is not available.
To monitor and correct data in case of failure, a monitoring tool should be taken into consideration early on. A so-called data-centric monitoring solution is also helpful when building applications, especially for Apache Kafka, and can save a lot of time. Why data-centric? This means that it is specifically about the data in the topics. In connection with rights management, you can filter for specific data objects, analyze data, and create or change data records.
Our monitoring solution KaDeck for Apache Kafka enables not only developers but all involved stakeholders to analyze data in the topics, modify data sets, and resend them. Moreover, KaDeck enables better collaboration, better and faster integration tests, and smoother handovers to application operations, eliminating unnecessary meetings and reducing misdevelopments. (Try our free Community edition to get an impression.)
Having different components for the same scenarios makes your system landscape more complex and inflexible and eventually leads to higher IT costs. If it is conceivable that you will need to integrate more data sources into similar target systems in the future, a modular and flexible design of your adapter can save a lot of costs in the long run.
There are many approaches that lead to more modular and flexible software. The approaches that we have used when developing SAAPIX are:
- Plug-in architecture for business logic and specific system-related components
- Configurations for topics, components and much more
- Micro-service architecture
Data integration projects are naturally very complex and involve many stakeholders.
If the considerations mentioned in this article are taken into account early on in the architecture and project planning, long-term problems can be prevented and corresponding projects successfully implemented.
Tell us about your experience or contact us with any questions. We are happy to support you with your project.
Visit our website to learn more about our smart Apache Kafka to SAP data integration solution: www.xeotek.com/saapix .