Troubleshooting Kafka for 2000 Microservices at Wix

Published in

Wix Engineering

12 min readDec 9, 2022

Debugging microservices’ event driven communication is not trivial, especially when it’s based on event streaming platforms like Apache Kafka. While scale grows more easily and the distributed system becomes more decoupled and resilient, the operational aspects become much harder.

Microservices communicating via Kafka with production issue in Inventory service

How do you understand why your service’s database has some incorrect values? (And increase your org dev velocity?) How do you get to the root cause and reduce debugging time? (And save money and effort?) Event processing sometimes fails or is extremely slow and a consumer lag builds up, how do you bring it back down?

In the past, Wix developers had experienced stressful, anxiety-filled moments where they needed to fix some urgent production issue related to the event streaming layer, but they were blind to the root causes and powerless to fix them in a straightforward fashion. Instead they were required to drop processing of many events or execute time-consuming DB fixing scripts.

In the last few years my Data streams team, in charge of the event-driven infrastructure on top of Kafka, created many tools and services in order to give the full power and visibility to Wix developers to easily troubleshoot and fix event-streaming related production issues. We have seen the incident resolution time dramatically decrease with more pinpoint solutions that avoid costly database repairs.

Following are 5 must-have-features to easily troubleshoot and fix event-streaming related production issues:

1. Trace events flow

Ecommerce flow
Consider the classic ecommerce flow, where a user places an order and triggers a chain of events:

OrderCreated event is produced by order service
Payment service consumes the order events, performs payment and produces a paymentCompleted event
Inventory service consumes paymentCompleted event and produces inventoryUpdated event

Of course, the actual Wix ecommerce system is much more complex and has many different events flowing between many different microservices.

Data corruption
At some point Wix developers noticed that the inventory service database does not reflect the actual inventory levels of some products.

Even with this simplified example, you can start seeing how hard it is to track the events flow and get a clue for the root cause of the issue. Was the OrderCreated or PaymentCompleted event produced correctly? Do they contain all relevant products? Or maybe the inventory service failed to process the PaymentCompleted event correctly?

Without proper tracing tools, developers were left searching logs of each service for clues, or manually streaming many events in multiple topics in the hopes of finding the potentially faulty events.

Request Ids Propagation
In order to easily trace the flow for each request throughout ecommerce services, we put the original requestId in the event headers. These headers are then deserialized and written as extra metadata to monitoring and logging infrastructure. The requestId is also put in headers for new produced events which recursively allows to propagate this metadata in all participating services.

This “Context propagation” is provided out-of-the-box for all Wix service by Greyhound, a JVM library and polyglot sidecar that forms a data streaming abstraction layer on top of Kafka with many features that help developers with features that Kafka itself lacks, like produce/consume retry strategies and higher performance parallel and batch consumers.

By having the request Id attached to all logs and metrics, we can easily see in Grafana dashboards which services successfully processed events related to this requestId:

events tracing between artifact via request id

Logs are also easily filterable by the requestId, so you can see relevant logs for this event in all services at once, including the logs of the relevant inventory database changes.

Open Telemetry
In addition to placing requestId in Kafka record headers, Greyhound also exports OpenTelemetry traces and spans information. While the data is currently available for each service internally, In the near future, full distributed tracing information will be provided, including full breakdown of operations

event handling breakdown via Open Telemetry

2. Easily lookup a specific event payload

Continuing Ecommerce data inconsistency investigation
Let’s continue the ecommerce example from the tracing segment above. We want to investigate why the Inventory service database does not reflect the actual inventory levels of some products.

Let’s say we are satisfied that all events were successfully produced and consumed by looking at the tracing dashboards for the suspicious user requests. Next we want to investigate if all products were accounted for in the OrderCreated and PaymentCompleted events related to the suspicious user requests.

Kafka Control Plane Apps
In order to do that we need to look at the relevant event payloads. Some Kafka control plane applications allow you to stream and filter events, including UI for Apache Kafka, Confluent Control Center, and Conduktor. You can also use Kafka CLI tools to find relevant events.

Wix Custom Control Plane — Event Stream View
At Wix, we created our own custom control plane, in order to utilize the power of Wix’s Greyhound Platform’s metrics and configuration. One of the first features we implemented was the ability to stream events from specific topics. The user first chooses which region and Kafka cluster he wishes to stream from. They can also choose to filter the stream, to start from some time in the past, and/or from some partition or offset. They can also specify specific text to search for in Kafka record key, payload or headers.

Event payload streaming on Wix Kafka Control Plane

In our ecommerce Inventory investigation case, we can search for the suspicious requestIds and look inside their payloads and check if all the items are accounted for.

Duplicate processing and Idempotency
As it Turns out, the events had all the items in place, but upon closer examination of the logs, it was discovered that the Inventory service processed the same events twice. Because of some partial failure during processing that happens from time to time. The service developers didn’t make sure to include deduplication protection in their database updates and make them idempotent.
For more information on how to introduce idempotency to your service, you can review the 5th pitfall from one of my previous articles.

3. Investigate “slow” consumers root cause

Consumer Lag Alert
At Wix, once a consumer fails to keep up with the amount of produced events for more than 30 minutes, an alert is triggered in Grafana and the consumer owning team is notified in Slack. For Kafka this phenomenon is called Consumer Lag (for queue based brokers such as RabbitMQ it’s simply called a large and growing queue.)

Inventory Service Lag
Let’s go back to our ecommerce example. A consumer lag is building up for the Inventory service Payments topic consumer. The developers are starting to be worried that the inventory database will not reflect the reduced inventory due to recent completed purchases.

Produced/Consumed Rate Grafana graph
As they start to investigate the consumer lag following the Slack alert, WixStores developers go to the Inventory service Grafana dashboard and see the difference in the throughput between producer and consumer for the Payments topic. While the produce rate is ~4K RPM, the consume rate is only ~1K RPM.

“Stuck” Consumers Grafana graph
Another very important graph that Inventory Service devs can view is called Longest Running Consumer Handler, where they can see if any/all consumer instances are “stuck” on handling.

Longest running Consumer Handler gauge shown in Grafana graph

This graph is made possible by Greyhound, our Kafka wrapper library and service. Greyhound exposes a Prometheus gauge for each topic consumer handlers, such that it measures the currently longest running consumer handler (the longest running provided lambda function) running in this pod. For more on how Greyhound works, you can check out this video.

Restart “stuck” Kubernetes pod
Back to the investigation. According to the above Grafana graph, it seems like the Inventory Service Payments topic consumer is stuck in one of the Kubernetes pods. The developer decides to restart the pod in the hopes that the issue will go away and Payments topic consumption rate will go back to normal.

But alas, it doesn’t help. The consumer rate stays lower than the producer rate. But not to worry, there are other monitoring tools she still hasn’t used.

Consumers Lag View in the Wix Kafka Control Plane
In order to help the developers investigate the root cause of consumer lags, we created a dedicated view in our custom control plane where the developers can easily see which Kafka topic partitions have the most lag.

In the ecommerce inventory case, after the pod restart, the lag kept growing for the same partition, just with another pod being assigned to it.

View “last” event
In case the lag is mostly found in 1 partition, the devs can click on “Last Message” button and view the current event being processed, to understand if its payload or headers may cause the issue

After reviewing the event payload, the developer understood that the event includes a huge amount of items, which the Inventory service fetched to memory and caused the Kubernetes pod’s service container to run out of memory, which caused the JVM process to become unstable and the consumer handler to get “stuck” and lag to build up.

4. Skip or replay events

Continuing with our ecommerce example, in the last segment, we reached a situation where the Payments topic Inventory service consumer lag was caused by a “rogue” event with a huge payload.
The developer decided that the best course of action would be to just skip the event, as she saw in the stream view (described above in section 2) that the issue is isolated to just this one event. It will also mean she can fix the underlying issue in a more relaxed setting, and then replay the event once the bug is fixed.

How to skip or replay events in Kafka?
1. Kafka SDK seek API
While Kafka allows consumers to skip or replay events by using the seek api, it’s not realistic to expect that the consumer code will know in advance the exact scenarios when it is needed to skip events, how many events, and so on.

2. Kafka-consumer-groups.sh command
In the past, we utilized kafka-consumer-groups.sh cli script in order to change the consumer group current event offsets, but it required shutting down all consumer pods, so they will be able to start again with the newly configured offsets.

Another disadvantage is that it’s not easy and straightforward to work with this command line interface, so it was mostly up to our streaming infra dev team to execute these skips/replays on behalf of other backend developers at Wix.

3. Change consumer group name
In cases where it’s possible to drop all unprocessed events, another option is to change the consumer group name, and start processing events “from latest” offsets.
If replaying all available events is possible, the configuration of “from beginning” can be set

4. Easily request Greyhound to skip or replay events
Recently, my streaming infra team developed a way to request from Greyhound, our Kafka wrapper library and service, to skip or replay events.

Currently the request is done via RPC, but soon there will be a new Skip/Replay events view in our Wix Custom Kafka Control Plane.

The user requests to change a consumer group’s offsets for one or more partitions.
Either with the explicit offset value, or from some timestamp.

The inventory service configuration is changed to include the new reset-offsets configuration.
The configuration change causes the service pods to restart and the consumers to start from the requested offsets to start from.
An additional mechanism makes sure that the reset will not happen again on subsequent restarts, but its details are not in the scope of this article.

Inventory service consumer group lag finally drops
Once the Inventory service consumer group successfully skips the problematic event after the developer requests it from Greyhound, the consumer group lag starts to drop and the production issue is resolved successfully.

5. Redistribute Single Partition Consumer Lag

There are times where a consumer lag is caused by a misconfigured Kafka event key. When an event is produced with a key, all events that share the same key are produced to the same partition and thus are processed sequentially.

produce with same key always goes to same partition

Events with the same key produced to the same partition. Each Partition is processed sequentially by a different Consumer pod (C1..C4)

Misconfigured Kafka event key
Continuing with our ecommerce example, in Black Friday 2021, the Payments topic Inventory service consumer lag was caused by relatively slow consumer handlers, but also because of problematic choice of Kafka event key by the payments service. The key was configured to be the vertical app id which means small distribution such that many events go to the same partition instead of utilizing all the partitions of the topic.

Quicker lag resolution
Involving the developers of payments service to change the key and deploy a new version took a long time and caused the issue to remain unresolved for a few hours. So as a lesson, Wix streaming infra dev team decided to develop a feature to redistribute the events that cause the single partition lag among all partitions.

Out-of-order processing
While redistribution of the events can cause out-of-order processing, for critical situations where the lag must be removed quickly, or in cases where order doesn’t matter to this particular consumer, this type of remediation can be quite useful.

In December 2022 we’re still prototyping the solution in production. I will update this article with our full robust battle tested solution once it becomes available.

Summary

Investing in good infrastructure for troubleshooting event streaming Apache Kafka related production issues is critical for being able to scale an event-driven architecture for your distributed environment of microservices.

Make sure to have the following monitoring capabilities:

Trace events flow
Ascertain which services successfully processed events related to a user request and which haven’t by using OpenTelemetry protocol for example
Easily lookup a specific event payload
Make sure event payload is not corrupted or the root cause of some unforeseen issue
Investigate “Slow” Consumers root cause
Have relevant metrics such as longest running handler, and lag-per-partition view in place

Monitoring features have to be accompanied by remediation tools such as:

Easy events skip/replay
Have the ability to skip events to remove consumer lag, or replay events to fix database issues
Redistribution of single partition lags
Have the ability to redistribute single partition lags in order to resolve slow consumption due to badly distributed kafka event key

These capabilities: Trace, Lookup, Longest-Running, Skip, Redistribute, form the Acronym TLLSR

Having infrastructure for microservices event-streaming communication (like Wix has Greyhound and a custom Control Plane) can get you closer to your goals of higher dev velocity and reduced costs for maintaining a scalable and resilient event-streaming architecture.

There are several open source (UI for Apache Kafka) and commercially available tools (such as Confluent Control Center and Conduktor) that can help you get there faster.

Thank you for reading!

If you’d like to get updates on my future software engineering blog posts, follow me on Twitter, LinkedIn, and Medium.

You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.

If anything is unclear or you want to point out something, please comment down below.