How we made strategic architecture decisions for the Inception Service
Authored by: Danielle Zegelstein, Engineering Manager | Serge Vartanov, Staff Software Engineer | Jeff Glasse (The Belated Engineer), Senior Engineering Manager | Cooper Jackson (cajaks2), Staff Site Reliability Engineer
In our first post, we touched on the execution of building the Inception Service for Swipe Night Season 2. Following up we dive into decisions that were made, challenges that occurred along the way, and how we delivered the Inception service that changed Swipe Night into what it has become today. With only three weeks until launch, we saved this part of the project as a MAX addition because we were hard at work on core functionality and this was a value-add. In order for Inception Service to become usable, we had to address the Kafka in the room. Our main technical challenges include:
- Getting through a backlog of Episode 1 users while optimizing cost. We had to figure out how to render personalized videos while not blowing the budget.
- Starting week 2, for every new person that plays Swipe Night Episode 1 we needed to render their personalized video within 10 minutes of when they started playing.
The Kafka Conundrum
It’s true that Kafka is an industry standard for event-driven architecture, and it suited our particular requirement, which was that we needed videos to be personalized, rendered, transcoded and made available via CDN in response to an event behind fired — and the workflow had to run end-to-end within a 10-minute window. Once set in motion, the average workflow would take around 3 minutes, so it didn’t have to be immediate, so long as there wasn’t a long queue or backlog. Kafka is the perfect technology, but our scale requirements presented a stability problem as we had over a thousand Kafka consumer workers and the removal of any worker pod triggered a reassignment of the other thousand pods in the consumer group.
Each message represented a request to render a personalized video — a computationally intensive task that could take 2–6 minutes, during which the system might experience a rebalance. We had to ensure that pods became aware of a reassignment as soon as possible and had a strategy for quickly returning to a stable state without wasted effort. We had to build this system in such a way that it could repair itself. And it was through trial and error that we got to a model that was self-healing.
We did this in part by configuring our pods to heartbeat in a background process and kill the rendering work and flush the current batch without committing the offset at the first sign of a heartbeat error. If a rebalance occurred, the consumers would be assigned a new partition as quickly as possible, but without manually flushing the previous batch of messages, both batches ( or more ) could be consumed in parallel. The orphaned batch offsets would never be committed to Kafka and the load of multiple video rendering processes running on a single pod consumed all available CPU on their respective machines, causing serious performance issues.
Since the partition that was orphaned was reassigned this also led to two consumers computing the same batch, except only one could commit to Kafka. As the number of messages in a batch numbered in the hundreds, this could lead to thousands of redundant videos rendered as both the orphaned consumer and proper consumer were doing the same work. To prevent messages from being lost or rendered multiple times, we also maintained state in terms of video personalization progress in a separate micro-service that the consumers would check in with upon picking up a batch of messages before rendering a video.
10 Minutes or Less
For those that played Swipe Night episode 1 the week it premiered, we had ~1 week to render their personalized video for episode 2. We wanted this process to take almost that whole week to smooth out throughput and minimize costs. For every million users completing episode 1, since rendering and transcoding a personalized video took an average 3 minutes, resulting in 50 thousand hours of compute capacity that had to be scheduled before episode 2 launched. We were very conscious about not needlessly over-provisioning resources for this. For the most part, this felt like an offline processing problem.
For those playing episode 1 when episode 2 was available, we had less than 10 minutes (roughly the length of time it took to play Episode 1) to render and transcode their video, meaning that no partition could have consumer lag of greater than 3 messages. This is the ongoing state for this feature and required efficiency. We were able to slightly optimize our process knowing that members needed to complete episode 1 before beginning episode 2. We started by kicking off the personalized video process when a user completed episode 1. A day or two before episode 2 was released, we started the personalized video process whenever a member began or came back to Swipe Night to ensure those members would have the personalized experience ready for episode 2.
Having a second Kafka consumer group that could be spun up as needed was really helpful in allowing us to build one system to handle these different traffic patterns.
The requirements changed after episode 2 became live because people could play episode 1 live and immediately needed their personalized video when loading episode 2. We no longer had a week to process incoming videos. We started processing personalized videos on episode start vs episode complete. That allowed us enough time to have a fully rendered video by the time users were able to start episode 2. We also adjusted the scale of how many workers were working [pre-processing as many videos as we could and then processing videos as quickly as possible in real-time. as traffic to Swipe Night ebbed and flowed across the world. We had this scaling strategy configured in advance so we could preemptively find a solution to address it.
This was a buzzer-beater, down-to-the-wire type of project. In fact, we didn’t know that it would work until 10 hours prior to announcing it company-wide. Cool innovation* was definitely at the heart of this project, because at the end of the day, we’re looking to build MAX. The final product was a huge milestone at Tinder, but the journey that got us there was deeply rooted in teamwork, ambition, and a good sense of humor. When things were challenging we were able to find the humor in it all and persevere through it. Encompassing these moments is what makes working at Tinder so fun.
If you’re interested in working on an ambitious and good-humored team, we’re hiring!