Debugging a critical memory leak in Rapido’s message transporter service

Swaminathan Muthuveerappan
Rapido Labs
Published in
3 min readMar 9, 2023

Rapido Communication Engine (RCE) is a service built from scratch at Rapido by the Application Platform team. The main purpose of this service is to transport messages from Rapido’s internal backend systems to customer or captain mobile applications. The messages can be in the form of push notifications, banners, alerts etc. The service uses the light weight MQTT (Message Queuing Telemetry Transport) as transport protocol for delivering messages.

Rapido Comms Engine (RCE) Architecture

After the production release of RCE we observed frequent service restarts (OOMs) due to increased memory utilisation. A closer look at the memory (heap) footprint of the service revealed spikes at periodic intervals as shown in the screenshot (Fig 1). This type of distribution is commonly caused when the Garbage Collector kicks in frequently due to a memory leak. With this hypothesis, we started gathering and analysing the memory footprint of the service itself.

Fig 1. Memory Utilisation of the service

Eclipse Memory Analyzer is the tool that we used to analyse the heap memory dump taken from one of our production RCE service instances. This tool shows detailed information of the JVM heap like histogram of all the objects, probable leak suspects etc. In the “Histogram” view, after sorting objects by retained heap size we identified several Kafka producer client objects present in the heap as shown in the screenshot below (Fig 2). This was weird because the service is expected to create only one instance of the client during the initialising phase. RCE uses Kafka to push events to downstream systems for analysis. The events can be “message_created”, “message_pushed_to_mobile” etc. Using the statistics from the memory heap dump analysis exercise, we picked out the places that were calling the Kafka.createProducer() function. This function creates a producer client that is used for producing messages to Kafka. To our suprise this function was being called in the service health check API.

Fig 2. Eclipse Memory Analysis

The RCE health check API checks the availability of the service and all its dependent components like Kafka, Database etc. For checking the availability of Kafka we used the Kafka.createProducer() function from the Kafka Confluent library because this call would fail when Kafka is unreachable, thereby failing the health checks which is the expected outcome. But upon successful connection this function was creating a new producer for every health check call ultimately leading to out of memory (OOM) issues.

The fix was to avoid creating the Kafka producer client during service health checks. We create the producer once when the service starts. After deploying the fix we observed close to zero service restarts and the memory footprint was principally stable.

Credits & thanks to all the members involved in this task.

Team involved in developing the MQTT based message transporting service

Rishikesh CT, Senior Product Engineer Backend
Inder, Senior Product Engineer Mobile
Subhadip Sinha, Product Test Engineer
Swaminathan Muthuveerappan, Engineering Manager

--

--

Swaminathan Muthuveerappan
Rapido Labs

I like to discuss about philosophy, spirituality & leadership. I am passionate about technology, teaching and management.