Scalability at BUX
Using an actor system to go from 100 users to 1.5 million
In the spring of 2014, when the BUX development team started work on our trading app, we had little idea just how much we would grow over the next 4 years. If I take a look at the architecture diagram that Joost van de Wijgerd, CTO of BUX, drew on a whiteboard more than 4 years ago, I can honestly say that not a lot has changed in our setup. Except for 1 or 2 components that we’ve added for new functionality, it’s exactly the same. A cynic might say “well, that just means you haven’t innovated”. But think about this: 4 years ago, we were a small start-up working in a subletted office on a shoe-string budget. What we needed to do is prove to our seed investors that we can build the product the BUX founders envisioned and we needed to do that quick. It took us 6 months to have an app in the AppStore. Some tech start-ups would have cut as many corners as possible to get something out the door that quick. Hack now, worry about it later, when the money starts flowing. At BUX, we adopted a different mindset.
Will it scale?
The question we always ask ourselves when building a new feature, when adding a new component to our tech stack, or when thinking about how we want to deploy our infrastructure, is “but will it scale?”. That is how we went from a start-up with 100 users in the first week after launch to a respected FinTech company with more than 1.5 million registered users, without changing anything in our core architecture. Without having to rethink our services, without having to throw work away and without having to start from scratch. Did we have teething issues? Did we make mistakes? Is hindsight always 20/20? Of course. No plan is foolproof and no architecture is perfect. This post is not about describing a silver bullet — there’s no such thing. It’s about giving a short introduction to a real life use case of when an actor system is useful. It’s about describing how an actor system works and about how it helped us achieve our goals of horizontal scalability. And how, in many instances, it made our lives much easier.
BUX basic architecture
Let’s talk a bit about the basic architecture of the BUX app. When people ask me what I do, I always say “we’re building an app”, or “I work for a company building a trading app”, when in fact, I am a backend developer. Except for a small hobby project, I’ve never written code for a mobile app. So what do I actually do at BUX? Well, as my title suggests, I’m part of the team responsible for writing the server-side code. As many mobile applications these days, our apps are “thin clients”, meaning that most if not all of the relevant user data is stored on our servers and not on your phone. The backend team provides a RESTful API which is called by the Android and iOS apps to provide our users with a rich trading experience.
Anatomy of a request
If you haven’t played with the BUX app yet, I encourage you to do so. Even if I do say so myself, it’s a fun experience, regardless if you have any trading knowledge or not. In any case, here’s one of the more used screens of the app, what we call the portfolio page:
Here I can see my currently open positions and the performance for each. If you were to guess, how do you think the data for this page is being built? Where is it being stored, how is it being retrieved? Obviously the user data is stored in some database. And as previously mentioned, we’re using a REST API to communicate with the outside world. So the following schema is not that far away from our actual setup:
The following JSON could be returned to the mobile clients in order to build the page shown to the user:
“Traditional” data storage and caching
But where to store this information? Because I don’t want this article to turn into a “X is better than Y” discussion, I’m not going to mention any (noSQL) database names. I truly believe you can make almost any technology stack work for almost every use case, it’s just a question of how much time and money you’re willing to spend. I will mention, where relevant, the technology stack that BUX has chosen.
You could choose to store the user data in any of the (noSQL) databases out there and would probably get good results. But let’s imagine that you want a certain level of performance when retrieving the portfolio page. After all, your users want to see their positions as quickly as possible, not wait around for half a minute while their page loads. What happens when you’ve reached a level of requests per second where your database can’t keep up anymore? Leaving aside knee-jerk reactions like “we can switch the database”, what most people would do in this case is implement a caching solution to make the queries as fast as possible:
This would work fine and it’s probably a good idea to have a caching layer in front of your database anyways. But if you’ve reached a level of requests per second that would affect the performance of your database queries, you most likely already have a highly available setup as well:
So now you’re looking at a distributed cache. Every time your database updates, you have to make sure your cache on all servers is up to date as well. I’m sure you can imagine (or even experienced first hand) the difficulty of tracking down a bug that manifests itself once in a blue moon but that leaves one of the caches, on a single server, out of date.
Introducing elastic actors, a distributed, persistent actor system
So what’s the alternative? What if your cache would get updated at the same time as your database? What if you had a framework that not only took care of that for you, but also make sure that the system is load balanced and horizontally scalable? That’s where the actor system comes in, for BUX. We use a distributed, persistent actor system called elasticactors. It’s an open source framework, developed by Joost, CTO and co-founder of BUX. You can take a look at all the framework code here:
Simply put, the state of each user in the system is represented in JSON format:
This JSON is deserialized to (serialized from) a simple POJO:
Accessing the actor state
We call the above Java class (or the JSON representation of it) the “actor state”. In order to access the above state we have to “ask” the user actor for its own state. The user actor itself is the only one who has access to that state. We can do that from a simple REST endpoint. And using Spring 5 WebFlux functionality, we can keep the boiler plate code to a minimum:
In a single line of code, I’m “asking” the actor for its state (or a part thereof), by sending it a message called PositionsRequest. And I’m expecting a response of the class “PositionsResponse”. For completeness, this is how these classes might look:
What’s an actor?
I’m sure you’re seeing a pattern by now. Almost everything in our framework is a POJO. Some annotations here and there, maybe an interface to implement. But by having simple building blocks, everything is straight forward, easy to follow, to understand and to code against. Once a new developer learns a couple of conventions we have in our code base, they’re good to go almost from day one.
But what is the “ActorRef” object I’m creating and referencing in the controller? As the name says, it’s the “actor reference”, the class that represents the actual actor in our system. An actor always has two components: the actor state and the actual actor which is responsible for all the business logic.
I’m overriding the “onReceive” method that handles all incoming messages. Right now, I’m only handling a single message, but you can imagine that as an actor grows, its responsibilities grow as well. It will end up handling more messages and having an endless if/else or a big switch statement is not something that’s easily maintainable. Fortunately, the framework allows for separate message handlers for each individual message. All we have to do is create a method with the appropriate annotation and we’re good to go:
Under the hood
You’re probably wondering, by this point, how does the actor receive the message and how does the controller receive the response? It’s all done by the actor framework: through RabbitMQ (or any other messaging bus, if desired) the actors listen to messages and can also send replies on the same messaging bus. The framework hides a lot of this complexity, obviously. For example, in the controller, when the ask method is called, a lot of things happen in the background for the controller to be able to listen to a possible response from the actor it’s “asking” for positions.
Changing actor state
Of course, queries are only half the picture. What about actually changing the actor state? Well, same as a query, everything is done by sending the actor a message. The actor will have to know how to handle that message. We can build a new message class that will change the actor’s state, let’s say by adding a position to the user’s portfolio.
In the controller we can again have a very simple endpoint that sends this message to the actor:
In this case, I’m not waiting for a reply, so I’m not “asking” the actor anything. I can then use the “tell” method. The actor will handle the message in a very similar way to the PositionsRequest message:
The next time we ask the actor for its state, we will receive the updated positions view.
Actor system data storage
And that’s it! The actor is now able to answer our request and “tell” us what its state is and we are able to change said state. But how does the actor come to be? Where is its state stored, how is it retrieved in the first place? When handling a message the actor has access to the state that is loaded in memory and can modify it. The next time it handles a message any modifications will be reflected as the actor state will just be stored back in memory. But what about when the server restarts? What if the actor state is not loaded in memory, where does it come from? Well, the answer to that is simple, from the persistent storage!
The actor framework has a simple FIFO cache in which it keeps the actor states. It loads the JSON state from the persistent storage every time an actor is accessed and its state was not yet loaded in the cache. At BUX we persist all our actors in a Cassandra database, but this can be replaced with any type of storage. We chose Cassandra because it’s right for our use case: it has excellent writing performance, it can scale and it’s highly available. When an actor state changes, its state is persisted both in the cache but also to Cassandra. By default, after handling every message, an actor will persist its own state. The developer does not need to worry about that, it’s handled automatically by the framework. Of course, it’s a waste to persist the actor state if it has not changed (when handling a query, for example). We are in control of which messages trigger a state persist, with a simple annotation:
OK, but I hear you say, what’s so special about this? This whole thing could have been accomplished with a simple database and a straight forward write to cache first, database second. I agree. That works fine, for a single server. But what happens when you want to scale out? Your cache will only be up-to-date on the server that handled the write request.
But will it scale?
We discussed what an actor system is and by this point I believe you have a pretty good idea about how our actor system stores and retrieves state. But what about the distributed part? Well, that’s where the “elastic” name comes in. The elasticactors system is partitioned into shards. When an the system is bootstrapped for the first time, it’s configured with a certain number of shards. To give you an idea about scale, our production systems usually contain 256 shards. Each actor created in elasticactors is assigned to a single shard, by using a hashing algorithm on the actor id. That means, that for as long as that actor will exist, it will always be assigned to the same shard.
OK, so now we split each actor into shards. The actual “distributed” part comes in when we add servers to the cluster. Each shard is assigned to a server through, you guessed it, a hashing algorithm. I’m not going to go into the details, but if the number of machines in a cluster stay the same, a shard is always guaranteed to live on the same server. Every time. So, now we can scale out our cluster.
The load balancer is not aware of where a request needs to be handled, but in practice that’s not a big problem. A controller on any of the servers in the cluster gets the REST request. It then forwards a request to an actor. Does that actor live on a shard assigned to the same server? Great, the actor framework just forwards the message to the actor with one less network hop. It does not, no problems, the message can just be sent on the messaging bus and it reaches the actor anyway. Because the actor state is kept in memory, queries are really fast. And because the actors are distributed across the whole cluster and the framework itself makes sure that the right actor receives the right message, the developers don’t have to worry about keeping the cache in sync. And, finally, because elasticactors is designed to scale, servers can just be added to the cluster without any problems. When a new server joins the cluster, shards from existing servers are redistributed as equally as possible across all the machines. And when a shard moves to a new machine, so does every actor who is assigned to that shard.
By scaling horizontally not only do we reduce the amount of queries each server needs to perform, but we are also creating more memory space for our actors. You can imagine that as an app grows, so does its user base. If yesterday you could fit most of your user actors in memory on 3 servers, tomorrow 5 might barely be enough. For an actor system to be as efficient as possible it needs to have as many (if not all) of the actors that are regularly queried available in memory.
The main difference between having a single database and a distributed actor system is how you think about your data. An actor becomes both data (through the actor state) and business logic (through the message handlers). Obviously, in certain cases, this is not desired. But for many of our use cases our actor system gives us great performance out of the box with scalability included for free. It’s also a very simple model to develop against, making for a really shallow learning curve for new hires.
Now, should you start using actors for all your business problems? As the old saying goes, if you only have a hammer, everything looks like a nail. There are limitations to the actor model. The biggest one is that, by default, you just cannot ask the system for example “how many users have less than 2 positions?”. This would be a really simple query to do for many databases. But in our actor system you can only get the state of a single actor at a time. Of course, there are ways around that, but they usually involve relying on a separate database for queries like this.
I mentioned in the beginning that a actor system is not a silver bullet. I hope that now, that we’ve reached the end of this article, you have a better understanding of what an actor system can do for you, but, more importantly, where are the caveats when using one.