Teamwork: Implementing a Kafka retry strategy at Wise
Hey all, I’m Marcela, a Software Developer from Wise! I joined Wise in December 2020 (yes in the middle of the pandemic and London lockdown). It was quite a start: new city, new company, new team, new tech… I’m still learning every day, but so far very happy to be experiencing what all those agile books talk about: deploying to production whenever, no ceremonies, no finger pointing when things go sideways and not depending on other teams to deploy a feature that you own.
Talking about owned features, when I joined, I was very eager to be part of a product team*. Besides the deployment autonomy I just described, we’re also responsible for developing and sharing the domain we’re working on. To do this, we need to think in a product perspective. If this makes you think we might use DDD (Domain Driven Design) then you’re completely right! And that’s why we take full advantage of our cross-functional characteristics. Having teammates with different backgrounds (back end, front end and mobile developers, designers, product managers…) helps us ideate, create and release features that are valuable for our customers.
So how do product teams work together in practice? What I’ve learned through the past few months is that teamwork is essential. When you have an environment where people trust each other, help one another without judgement, amazing things can happen. This is the essential part of our autonomous teams culture.
In this blog I wanted to share a little story of one of my projects: how in my first few months I helped the team enhance an important feature and deployed it to production. Hope you enjoy it!
A little context: our Activity Service
My team works on the Activity Service, which is responsible for showing information on everything that the user does within their Wise account. Because of this service we’re able to show our customers transfer information: when payments are made and how much you spend, all in a convenient and transparent way.
The Activity Service is one of our core services at Wise — if it goes down, you won’t be able to see these beautiful pages:
It also means speed and reliability are extremely important. After all, our mission is to be instant, convenient, transparent and eventually free. So not updating the list of activities when a customer makes a transaction can lead to a very frustrating experience. And we definitely don’t want that.
Problem we were solving: Enhancing a fragile retry mechanism
When everything is up and running, our service is great at saving and showing activities. But when something goes down for just a couple of seconds, let’s say another service somehow died, then…. well we have some manual work to do to fix our activities. So how can we improve that?
We use Kafka as a stream processor, which basically means our service listens to a topic that contains messages, reads and saves them as objects in our database. This topic is being fueled with messages by other internal services. In other words, they’re the producers and publish the messages we listen to into a Kafka topic.
If something goes wrong, we can use our retry policy which, as the name suggests, will make the service retry to consume the failed message. If this still doesn’t work, we’ll publish this message in the dead letter queue. Publishing in the dead queue is important due to the nature of the objects we’re consuming — customer transactions — we can’t simply disregard a message if it fails. Plus we need to make sure we read all the messages coming from a topic. If a message reaches the dead letter, we have a way of checking its problem and fixing it so it can be properly saved for later.
Great! That’s the happy path. But what would happen if, for instance, the database connection went down?
Kafka would still send us messages, but we wouldn’t be properly consuming them. If the database was down we’d read a message and not save it, read another message and not save it again… read, not consume!
The difference is that when reading it the service would return an error because the message couldn’t be saved. In other words, it couldn’t be consumed. But Kafka’s offset would still change to the next message in the topic; a user wouldn’t be able to see the updated activity list:
Let’s imagine the user had just spent money buying some groceries with their Wise debit card. Although the balance would reflect the correct amount, the activity list wouldn’t show the latest activity (grocery shopping) because we have yet to consume the message that refers to it.
Implementing a new Kafka retry strategy
We realised we can leverage Kafka retries to fix this, but the problem was that our retry configuration wasn’t great. We’d retry only once and right after the error popped up. This obviously didn’t give the service that failed (in our case the database) enough time to recover itself from the problem.
We needed to change the amount of retries we had. Having only one wasn’t ideal. The other problem was the amount of time passing between an error occurred and the retry to consume a Kafka message. This led us to an interesting question: how much time is enough?
After some discussion we decided to join both problems into one solution: for each retry, the amount of time for the next retry would increase in an exponential manner. We started with 500 milliseconds, so basically we’d retry in the following intervals:
500 ms -> 1s -> 2s -> 4s -> 8s -> 16s -> 30s -> 30s
We also capped the maximum interval at 30 seconds and decided we’d retry eight times. These values were based on metrics the team collected which showed that our common problems were blips** from other services, so these intervals should get us covered.
After deploying those changes into production we started to notice some slowness when consuming messages. Well, we did add an extra load into our topic, you see. When a message failed to be consumed we were basically telling Kafka to retry sending it to us. To do that, Kafka would add that same message to the same topic our service is consuming from. In other words, you could say we were walking in circles.
“Hum…” we thought. “How could we avoid that?”. Again, this matters because it directly affected our service’s speed (Remember: we need to offer a product that’s instant). We could’ve been consuming a new message but we were retrying to consume a failed one instead.
That’s when Jesús came into action (yes, one of our team members): “We could create another topic, exclusively for the retried messages”. Fantastic. And that’s what our flow ended up being like:
To put into words, whenever a message fails to be consumed it’s published into another topic, the retry topic. In this new topic the failed message will be retried according to the retry policy I explained above.
With this strategy in place our main topic is free to keep on consuming only new messages — yay!
Learnings from this project
Kafka is a great tool — when used correctly. It comes with many things already built for you, it’s just a matter of knowing where to look. I’m far from being an expert on Kafka but I can say I learned a lot about it by making this change in our service. Of course, my teammates were there to help and guide me to the correct answers as well. I’m very grateful for having the opportunity to deepen my technical knowledge in a safe environment.
Thanks for reading :D
P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.
*Product team: Cross functional team that works on a specific functional area of the product line. Develops domain expertise that can serve multiple products across the company’s product portfolio.
**Blip: When a service fails to respond a health check, but it’s not down