Fault-tolerant and Scalable chat service on Spoon Radio

Published in

Spoonlabs

2 min readDec 26, 2018

Authors: Edward J. Yoon, James Kim, Caley Baek, Ella Kang

Abstract

Recently we are successfully deployed and released our Chatting application named Heimdallr[1] on Spoon Radio[2] real production infrastructure. Heimdallr is a large-scale chat application server inspired by the LINE+ chatting service architecture and written in Scala language based on Akka’s actor model. In this post, we introduce about how we are serving fault-tolerant and scalable chat service on real production.

Problems with Socket.IO

The legacy chat application was a single-threaded application, which based on Node.js (and therefore Socket.IO), meaning that it cannot take advantage of multi-core processors. Internally, the rooms and users variable of Socket.IO on the server would only contain the users connected to that particular server instance, as this variable is not shared between instances. So, each separate server instances need a way to communicate and share data like socket.io-redis even though they are on the same single machine. It makes complex communication patterns and difficult to manage a lot of small Node.js processes on multiple servers.

The new architecture of Heimdallr ChatService consists of a HTTP Server, a ChatRoomActor, and an UserActor based on Actor model of Akka which provides parallel message stream processing. Each ChatRoom and User is sharded at a load balancer level on multiple servers. To synchronize the messages among servers, we used Redis PubSub in a master/slave setup for redundancy.

Gradual Migration to Heimdallr

Unlike traditional web service application, chat application requires long-live TCP socket communications, therefore non-stop deploy is somewhat tricky. By using single common Redis Pubsub service with legacy chat application, we were able to migrate traffic gradually without stopping online service.

Dealing With Failure in Heimdallr

The actor model of Akka is meant to help you achieve a high level of fault tolerance. One of simple example is high-level server fault. Once supervisor detect the failure of actor, system sends failover message to the clients so that they can try to reconnect another health server.

Conclusion

Obviously it’s not mature yet and we need to experience more. We’ll continuously engineering and open sourcing the Heimdallr. Keep in touch!