Event-Driven Challenges: Seamlessly Processing Millions of SOAP & REST Events with Zero Latency

Ambar Singh
Deutsche Telekom Digital Labs
7 min readMar 19, 2024

A Customer 360° view is essential in today’s world to understand, engage, and serve customers which further leads to highly customer satisfaction and long-term loyalty. We began to create a complete view of our customers, known as ‘Customer 360° Platform’. Of course it’s architecture has many components for providing key feature and remember “Rome wasn’t built in a day ⏳”. To ensure that our Customer 360° database is properly synchronised with existing customer databases (legacy DBs) led us on a quest to develop a event process mechanism which is based on the Change Data Capture (CDC) solution.

In this blog, we unveil the details of our journey, where we are accomplishing the real-time synchronization of millions of customer events with minimal latency. We’ll explore the challenges we encountered along the way and, most importantly, the remarkable results.

Requirements & Challenges

  • Why do we require customer data change events? — Our strategy involves setting up a new customer 360 database and gradually migrating applications from legacy customer databases. Consequently, we’ll have two sets of customer databases, and to ensure that our Customer 360° database remains synchronised with our legacy customer databases, we need a solution that notifies us of customer data changes.
  • Customer Data Change Events — Our legacy databases already have notification channels capable of pushing event notifications triggered by customer data changes. However, a significant challenge we faced was the diversity of event formats and schemas unique to each legacy database.
Customer data change from legacy database to Customer 360°
  • Event Transformation- We have some diversity in events. They come in various formats, including SOAP (XML) and REST (JSON) protocol based, each with its unique schema. To ensure consistency, we had to harmonise and transform these events into a standardised format.
Event Type and their estimated numbers
  • Event Broadcasting — As we are transforming these events, our architects had strategically chosen to distribute these transformed events to other DT internal services. This means that a single event can be sent to multiple consumers based on their specific needs and specifications.
  • Streamline Event Filtering — Each consumer has its unique set of required events, and they only require those that are essential. This necessity driven the development of an event filtering feature.
Event Transform, Filter & Broadcast

Obstacles, Issues and their Solutions

Let’s explore our event driven solution, challenges, obstacles, and issues we encountered during the implementation of this event processing system and, of course, how we managed to resolved them.

  • Choose the right Message Queue tool — There are so many popular message queues are available such as Apache Kafka, Rabbit MQ, MSK, ActiveMQ, and many more, deciding on ‘which one and why’ posed quite a challenge for us. After careful consideration of factors like availability, maintainability, scalability, and resiliency, we ultimately opted for AWS MSK.
  • Build SOAP + REST Webservice Events that we are getting are based on REST and SOAP protocol. Our event processor service must be implemented in way that it consumes both type of events.
    Dealing with WSDLs can be quite a headache: — Some time you need to write some extra line of code for SOAP webservice and solving the WSDL parsing issues also require some expertise 🤓.
    Hundreds of concurrent requests with large XML payload :— Handling was yet another aspect that required some attention. It required robust memory management and efficient processing strategies.
    High Event Spike During Business Hours :— The legacy customer change events do not occur uniformly around the clock; instead, we’ve observed substantial spikes during our business hours. This pattern has required the maintenance of higher concurrency levels to effectively manage the increased workload.

To tackle these challenges, we implemented an ‘Asynchronous’ approach to transform customer events within our Event Processor service. Here’s how it works: When we receive customer change events in our Event Processor, it promptly collects the event payload and responds immediately with the HTTP OK 200 status. Internally, these collected event payloads are processed by Java threads in parallel, ensuring lightning-fast parallel transformation, and then they’re efficiently sent to the MSK queue. This solution has not only enabled us to maintain high throughput but also reduced latency to near insignificance when processing customer events.

  • ‘OOMKilled’ Kubernetes Error — In Kubernetes, ‘OOM’ stands for ‘Out of Memory,’ and it means that a container or pod was forcefully terminated because it uses more memory its allowed memory limits. Our Event Processor faced this issue a couple of times during performance testing. Upon closer inspection, we discovered memory leak vulnerabilities in third-party JAXB dependencies and within our sensitive logging snippets. Moreover, we weren’t entirely satisfied with the behavior of the garbage collector, especially given our use of Java 11.
OOMKilled Issue in Event Processor

This led us on a journey to optimise memory management, container resource allocation and fixing third-party vulnerabilities for smoother operation of Event Processor.

- Java 17 upgrade

- Upgraded JAXB de dependencies

- Refactoring code snippets of sensitive logging for optimized memory usages.

- Delving deeper into garbage collection (GC) strategies

These changes have not only improved performance but have also contributed in our system’s resilience and efficiency.

Garbage Collector Comparison JDK 11, JDK 16 & JDK 17
  • Challenges with Amazon MSK — During our experience with MSK, we primarily encountered the following two issues:
    MSK Broker Storage Problem: — Amazon MSK employs Amazon Elastic Block Store (Amazon EBS) for broker storage, with a default retention period of 7 days and a capacity of 30GB. However, during our performance testing, we simulated a load of 200–500 transactions per second over a week, resulting in ‘kafka.server.LogDirFailureChannel No space left on device’ errors. This led to connection timeouts as brokers couldn’t allocate memory.
    Event Retries Exhaustion and Timeouts: — In initial days of Event Processor, we encountered an issue of event retries exhaustion and timeouts approximately 3–4 times per month. The ‘retries exhausted’ problem was triggered by the ‘NOT_ENOUGH_REPLICAS’ issue, which occurs when an old broker is replaced by a new one, (typically within 30 seconds). Unfortunately, we had enabled acknowledgement for all the replicas, which was not the intended configuration. As a result, messages were retried because replica is not available during switch, but the ‘delivery.timeout.ms’ setting was quite low, leading to timeouts that cascaded until all retries were exhausted. This typically resulted in message losses ranging from 10,000 to 50,000 in 5 minutes, particularly under high loads.
Amazon MSK Broker Storage Problem
Increased Retry Rate Resulted To Increase Error Rate
Retries Exhaustion and Timeouts

To resolve these issues, we enabled autoscaling for broker storage so that enough memory is always available. Additionally, we configured CloudWatch alerts for proactive monitoring and issue detection. In response to the retry problem, we made customizations at the application level, adjusting Kafka configurations such as ‘ack,’ ‘retry.backoff.ms,’ ‘delivery.timeout.ms,’ and more. These configuration changes not only mitigated the occurrence of the ‘NOT_ENOUGH_REPLICAS’ issue but also allowed sufficient time for proper retries.

  • Filtering and Broadcasting: Filtering and broadcasting using Apache Kafka (MSK) posed no major challenge, as it already offers robust solutions like Kafka Streams and Kafka Connectors. In our interest of finer control over event filtering and broadcasting, we chose Kafka Streams. By using Spring Kafka Streams, we efficiently process events with consumer-specific filter rules and we’re able to efficiently route events to the appropriate channels based on subscriptions.
Event Processor Architecture

Summary

In our pursuit of developing a comprehensive data platform with a focus on customer insights, we’ve encountered numerous hurdles, embraced innovations, and arrived at solutions. Constructing such a platform from scratch involves intricate components and complexities that demand attention. As the saying goes, “Great achievements take time”.

This article aims to guide you through the unique journey of how our team, specializing in data platforms, achieved the real-time processing of a large volume of customer events with minimal latency. We delve into the obstacles we faced, each met with inventive solutions. These challenges range from tracking changes in customer data to the transformation and standardization of diverse event formats, as well as filtering and broadcasting. Throughout this article, we share our insights and experiences.

Ultimately, we explore event-driven solutions, and below, you’ll discover some of our test results. We remain committed to our ongoing quest for knowledge and self-improvement, constantly working to enhance our event processing capabilities. As we continue to improve our data platform, we eagerly anticipate more exciting and rewarding challenges ahead.

Test: Total Event Processed in 7 Days (@200–500 TPS)
Test: Total Event Consumed in 2Days
Events Received @ 38k per min (634 TPS) in Production

--

--

Responses (2)