How Tinder delivers your matches and messages at scale

Written By: Dimitar Dyankov, Sr. Engineering Manager |Trystan Johnson, Sr. Software Engineer | Kyle Bendickson, Software Engineer| Frank Ren, Director of Engineering

Intro

Up until recently, the Tinder app accomplished this by polling the server every two seconds. Every two seconds, everyone who had the app open would make a request just to see if there was anything new — the vast majority of the time, the answer was “No, nothing new for you.” This model works, and has worked well since the Tinder app’s inception, but it was time to take the next step.

Motivation and Goals

There are many downsides with polling. Mobile data is needlessly consumed, you need many servers to handle so much empty traffic, and on average actual updates come back with a one- second delay. However, it is fairly reliable and predictable. When implementing a new system we wanted to improve on all those negatives, while not sacrificing reliability. We wanted to augment the real-time delivery in a way that didn’t disrupt too much of the existing infrastructure but still gave us a platform to expand on. Thus, Project Keepalive was born.

Architecture and Technology

Whenever a user has a new update (match, message, etc.), the backend service responsible for that update sends a message to the Keepalive pipeline — we call it a Nudge. A nudge is intended to be very small — think of it more like a notification that says, “Hey, something is new!” When clients get this Nudge, they will fetch the new data, just as before — only now, they’re sure to actually get something since we notified them of the new updates.

We call this a Nudge because it’s a best-effort attempt. If the Nudge can’t be delivered due to server or network problems, it’s not the end of the world; the next user update sends another one. In the worst case, the app will periodically check in anyway, just to make sure it receives its updates. Just because the app has a WebSocket doesn’t guarantee that the Nudge system is working.

To start with, the backend calls the Gateway service. This is a lightweight HTTP service, responsible for abstracting some of the details of the Keepalive system. The gateway constructs a Protocol Buffer message, which is then used through the rest of the lifecycle of the Nudge. Protobufs define a rigid contract and type system, while being extremely lightweight and super fast to de/serialize.

We chose WebSockets as our realtime delivery mechanism. We spent time looking into MQTT as well, but weren’t satisfied with the available brokers. Our requirements were a clusterable, open-source system that didn’t add a ton of operational complexity, which, out of the gate, eliminated many brokers. We looked further at Mosquitto, HiveMQ, and emqttd to see if they would nonetheless work, but ruled them out as well (Mosquitto for not being able to cluster, HiveMQ for not being open source, and emqttd because introducing an Erlang-based system to our backend was out of scope for this project). The nice thing about MQTT is that the protocol is very lightweight for client battery and bandwidth, and the broker handles both a TCP pipe and pub/sub system all in one. Instead, we chose to separate those responsibilities — running a Go service to maintain a WebSocket connection with the device, and using NATS for the pub/sub routing. Every user establishes a WebSocket with our service, which then subscribes to NATS for that user. Thus, each WebSocket process is multiplexing tens of thousands of users’ subscriptions over one connection to NATS.

The NATS cluster is responsible for maintaining a list of active subscriptions. Each user has a unique identifier, which we use as the subscription topic. This way, every online device a user has is listening to the same topic — and all devices can be notified simultaneously.

Results

One of the most exciting results was the speedup in delivery. The average delivery latency with the previous system was 1.2 seconds — with the WebSocket nudges, we cut that down to about 300ms — a 4x improvement.

The traffic to our update service — the system responsible for returning matches and messages via polling — also dropped dramatically, which let us scale down the required resources.

Finally, it opens the door to other realtime features, such as allowing us to implement typing indicators in an efficient way.

Lessons Learned

Of course, we faced some rollout issues as well. We learned a lot about tuning Kubernetes resources along the way. One thing we didn’t think about initially is that WebSockets inherently makes a server stateful, so we can’t quickly remove old pods — we have a slow, graceful rollout process to let them cycle out naturally in order to avoid a retry storm.

At a certain scale of connected users we started noticing sharp increases in latency, but not just on the WebSocket; this affected all other pods as well! After a week or so of varying deployment sizes, trying to tune code, and adding lots and lots of metrics looking for a weakness, we finally found our culprit: we managed to hit physical host connection tracking limits. This would force all pods on that host to queue up network traffic requests, which increased latency. The quick solution was adding more WebSocket pods and forcing them onto different hosts in order to spread out the impact. However, we uncovered the root issue shortly after — checking the dmesg logs, we saw lots of “ip_conntrack: table full; dropping packet.” The real solution was to increase the ip_conntrack_max setting to allow a higher connection count.

We also ran into several issues around the Go HTTP client that we weren’t expecting — we needed to tune the Dialer to hold open more connections, and always ensure we fully read consumed the response Body, even if we didn’t need it.

NATS also started showing some flaws at a high scale. Once every couple weeks, two hosts within the cluster report each other as Slow Consumers — basically, they couldn’t keep up with each other (even though they have more than enough available capacity). We increased the write_deadline to allow extra time for the network buffer to be consumed between host.

Next Steps

Now that we have this system in place, we’d like to continue expanding on it. A future iteration could remove the concept of a Nudge altogether, and directly deliver the data — further reducing latency and overhead. This also unlocks other real-time capabilities like the typing indicator.

Links

Scarlet for Android: https://medium.com/tinder-engineering/taming-websocket-with-scarlet-f01125427677

NATS: https://nats.io/