Reactive Streams at Gympass

How we have been building stream processing using Apache Kafka and Akka

Published in

Wellhub Tech Team (formerly Gympass)

7 min readJan 21, 2021

You probably heard that today’s world requires our systems to handle a huge amount of data. You also may have heard that batch processing isn’t an option anymore. It doesn’t matter what kind of usage the data will have. Real-time is required. Keep everything up-to-date all the time. How can that be achieved efficiently in different use cases?

A cup of context

On a given weekend, spending some quality time watching Creed’s movies on Netflix, you felt motivated to start practicing boxing too. You enrolled in Gympass and started searching for a gym near to your work. With the address in place, also a few filters like facilities with a shower and parking, since you have coworkers and need to park your Ferrari. You’re ready to wait for the results with your selected preferences. The question is, how much time are you willing to wait? 3 seconds? 300 milliseconds? I’d say that I don’t want you to see our loading screen, even though it’s pretty. It’s always frustrating when we have to wait too long for anything. If you ordered a meal, bought a poster of the Ninja Turtles, or bought something else that you don’t even need right now. We all have this little anxiety.

The use case I’ve just described requires a search-engine with support for full-text-search, such as Elasticsearch. In fact, our search team based in Lisbon (hiring, by the way) has been working hard on one of the search APIs that we have, the Gym Search.

After the search results, you’ve chosen a gym and want more information about it, maybe want to see a few photos of the facility, the opening hours because you’re a very busy person, and last but not least want to book an experimental class. The functional requirements here are different from the previous one. You’ve already found a gym you were looking for, and it doesn’t make sense to use a search engine as a key-value database. Given an ID, the service needs to retrieve the whole object in an unstructured manner with all details at once. A NoSQL database such as DynamoDB should do the job. Since one microservice should do one thing, have its own database, here we go with another API, the Gym Profile.

There are literally hundreds of thousands of users spread across the globe doing the same thing. Searching for activities, checking in at their favorite gyms or studios, booking live-classes, or a personal trainer session. Faced with such a challenge, our services need not just be responsive but they also need to be elastic, meaning that they have to handle 10k users as much as 10M users without any code change. Even with this varying workload.

Having the best tool to do the right job requires different techniques, patterns, and efficient ways to replicate data. Leveraging on CQRS and Event Sourcing we have a reliable way to publish state changes decoupled in both space and time since the only thing a microservice needs to know is the broker address.

Errors. Yeah, they happen in the most unimaginable way. It’s a huge mistake to think that they will never occur because of the test that you made on your machine. It’s natural due to the complexity of a Distributed System that errors will occur. There are 8 Fallacies of Distributed Systems if you want to know more. Nevertheless, microservices should be resilient in face of such situations.

To the most watchful reader, I’ve just described the Reactive Manifesto using a surface of the service mesh at Gympass.

Let’s talk business

State changes need to be available to all interested consumers as fast as possible, somehow. According to the Event Sourcing pattern, every event needs to be stored as a sequence, in an append-only fashion. That’s where Apache Kafka comes into play. Nothing better than its own documentation to describe itself:

“Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications”

It’s worth pointing out that an immutable log, append-only fashion is also central to the Kappa Architecture.

When talking about high-performance, Apache Kafka isn’t playing around. There is a lot of hard work on each layer to get a low-latency system, like zero-copy. Transferring data from a disk through a network requires a copy between the kernel and the application. The zero-copy technique eliminates the necessity of this copy by sending data directly to the socket, instead of pushing to the application in the user-space and then pushing down to the socket. By avoiding these unnecessary context switches and saving memory bandwidth, the performance can be increased by 65%.

Even though applications with replicated data in their specialized retrieval systems are expected to be eventually consistent, it doesn’t mean they can be out of date forever, which would occur if any event got lost. To avoid this situation, through persisting data on disk and replication, the broker guarantees a resilient and fault-tolerant place. But there are guarantees by the application side that need to be made based on the business. Consider a payment application that consumes events from this arbitrary broker for every user that sign-up, the event is processed in order to make the charge. What would happen if that event got delivered twice? A user would be mad and I would update my Linkedin profile, in a reasonable situation.

In this case of replicating data, the event can be delivered more than once until it gets an acknowledgment from the consumer, quite similar to TCP. Trusting in the idempotence of the message, the final result wouldn’t be inconsistent. The delivery semantics is crucial, as catastrophic side effects can happen, so choosing the stream processing framework requires knowledge of the use cases.

Akka Streams

At Gympass we have a polyglot service mesh, recently including Kotlin, but still, the majority are written in Scala following the functional paradigm. The Akka toolkit is well-known here since we use the HTTP module in every Scala API that we have. Akka has a module for streams that implements the Reactive Streams specification, meaning that it’s an asynchronous, non-blocking, backpressure stream processing.

The backpressure mechanism avoids a subscriber to receive more work than it can handle. A Source is a producer, it emits elements to downstream components. The Flow is where the elements are processed flowing through it and finally, the Sink is the subscriber. The way it works is simple, the Sink has a buffer capacity and requests elements to upstream components accordingly. This is known as a dynamic push-pull model. Since Akka Streams are built on top of Akka this capacity is the Actors mailboxes. The cool thing here is that none of that needs to be created from scratch.

Alpakka

Alpakka is a library to create integrations pipelines in a reactive and stream-aware way, offering connectors to a lot of tools such as Apache Cassandra, Hadoop, Elasticsearch, and of course Apache Kafka. Here is an example from the documentation that demonstrates how simple it is to implement a consumer.

From Alpakka Kafka docs

What Alpakka Kafka connector does behind the scenes is manipulate the KafkaConsumer class methods pause()and resume(). This is a really good thing since none of these complexities ends up being our concern, also the pooling, processing, and potentially the stages are all asynchronous, as Sean Glover explains here.

Gluing it all

We’ve been developing a service named the Tagus, which holds all kinds of contracts a user can have. Written in Go and backed by a relational database, as you’ve might get from the context, this service publishes every state change into Kafka to any interested consumer use as is needed.

In a different bounded-context, the User Profile API and the User Profile Worker live. The last, holding the Alpakka consumer catching every event from the user’s topics and saving it in an unstructured document in DynamoDB. The API then serves these documents with better performance then calling the source of the truth, since it doesn’t need to execute complex joins. To bring data, considering that you don’t have superpowers this API has a response time faster than you blink. A p95 average of 26.65ms.

Sourced in a relational database and sank in a NoSQL

From the producer side, there is another important guarantee that can’t be violated, the temporal order. To increase the throughput, topics in Apache Kafka use partitions to have the job running in parallel. While this is great, it also brings a counter side. The broker guarantees the order within the partition, but not at the topic as a whole, which means that if the Tagus service publishes a message such as USER_UPDATED for the same user twice, at different partitions, the guarantee is broken. To avoid that the producer uses a partition key to make sure that any message from that given key will end-up on the same partition. In our case, the user-id is perfect.

Conclusion

Event-centric thinking is everywhere now, and convergence in the requirements of applications and the big data & analytics world as subtly pointed out can be observed. There is always more than one way to build something. The most important thing is to know which one is best suited, much like a tailor that measures every inch instead of selling any average size.