A digital symphony — The architecture of API Fortress
A deep dive into the design of API Fortress, as an example of a microservices-based, actors-powered distributed software architecture
If you were to judge my personality based on the things I like about software engineering, you’d think I’m a very boring person.
The thing I like to design and build the most is jobs orchestration. I like to imagine how each piece of the software fits in the greater schema of things, and how the system will react in various scenarios, such as high load and failures.
I often find myself daydreaming about huge complex systems processing humongous amounts of data, distributing their jobs, and recovering from failures.
I’m not entirely sure if this makes me a weirdo, but it certainly is weird.
The first iteration of API Fortress was a prototype, like any software, so I really couldn’t express my obsessive disorder fully, but during its major second iteration, dated around 2016, I went full overdrive, and here’s what I came up with.
In this article, I’m sharing with you only the computational part of the API Fortress platform. I may eventually write more about the data structures and the front end in a later post.
I’m not saying this architecture is perfect, but I decided to share it with you because I thought it may be a source of discussion within your team, and maybe you could come up with an even cleverer idea for your software.
The microservices
Microservices have been a buzzword for quite some time and I’ve seen necessary and pointless migrations to this design in the last few years.
For us, it was a necessity.
The reason was the need to separate stateless components which had a unique role within a cluster.
All the microservices (orange) can receive tasks generated by any dashboard or core server, while the scheduler can introduce tasks to be performed by any capable service in the cluster.
They are easy to monitor, easy to update without disruption, decoupled, and their state is completely ephemeral.
The dashboard/core-server is still a big ass piece of software, though, and obviously, it carries the state of the logged-in users.
So how do these microservices talk to each other?
Microservices chatting #1 — an epistolary story
The message queue of choice, RabbitMQ, takes care of most of the internal communication for asynchronous tasks. Here’s how:
RabbitMQ plays four extremely important roles:
- It decouples each microservice from the rest of the cluster. Different implementations, one-to-one and one-to-many consumption become completely transparent;
- Implements a subscription-based model, where each service can subscribe to a certain queue of interest. By doing so, the message becomes the protagonist of the conversation;
- It determines how many messages should be delivered to each consumer before waiting for acknowledgements, therefore orchestrating the intensity of the parallel computation;
- It decides of the fate of each message, ensuring it gets delivered to the right recipient. If the recipient is unavailable, the message will be delivered to a similar recipient or wait for the recipient to come back, and also makes sure to receive a confirmation that the process spawn by the message comes to a successful conclusion (acknowledgement)
Microservices chatting #2 — being more direct
Certainly, RabbitMQ does an excellent job for asynchronous tasks, but other times microservices need to provide prompt replies to the requesting agent.
Therefore, direct HTTP APIs are also in use for this specific scenario.
The actor model at work
Now, each one of the services we’ve mentioned does many things to perform its tasks.
The mailer, for example, a quite simple service, has to deserialize the message, determine whether the email really needs to be sent, retrieve the recipients, process a template and eventually send it. Not only that, but it is also capable of processing a certain number of email sends at the same time. Not bad for a tiny microservice.
Not to mention the core-server, the actual system to process the tests. Retrieve the test code, load the context, initialize the test, launch it, store the results, propagate the effects, trigger the notifications…
There’s no such thing as a simple software. Because:
- It may start simple, but it takes minutes to become a mess;
- Stop for a moment and think what a so-called-simple task really does behind the scenes, divide it into its smallest components, and evaluate what can go wrong. You quickly realize nothing can be totally dismissed as simple;
This is why I adopted the actor model, in the splendid implementation known as Akka, by Lightbend.
The actor model is content for a whole, never-ending article, so I’m definitely not getting into the details here, but you need some context.
In VERY short, the objective is to split a big process into a multi-step flow, where each operation is an “actor”. A standalone worker which communicates with the other actors only by receiving and emitting messages, performs one task at a time, and in case of problems does not propagate the effects of the failure, while a supervisor decides what countermeasures should be taken.
Each actor can do one thing at a time, as previously stated, so to allow multiple tasks of the same type to be executed, all you have to do is creating more actors of the same type and a router will decide which instance of an actor will receive the message, based on a specific routing logic. This allows some beautiful, quite deterministic fine tuning, but it doesn’t end here.
Failure handling is also very important. We mentioned supervision, here are two examples of how we implemented it
I/O actors share a supervisor due to their peculiarity of having connections to databases. In this example, if a connection exception happens within the actors, it may very well mean that their connection to the database is spoiled, so the supervisor decides to restart the actors, hoping the re-initialization of the connection to the database will solve the issue. If notified of constant failures, after a certain number of attempts, the supervisor will consider the actors unable to perform their job and will take them down. This is a realistic scenario when a database becomes unavailable from where the actors live.
The system will, therefore, try to heal itself when possible, but won’t continue failing endlessly and take appropriate countermeasures.
Completely opposite example for the Execute actor that, when triggering a test code syntax error, is told to resume operations. The code came in via a message, does not relate to the actor’s state, therefore it wouldn’t benefit from a restart.
The actors’ commune
The final step. The scariest, most profound and philosophical. Even though it does not involve weed and bongos.
As you may have now guessed by the schemes, both dashboards and core-servers can scale horizontally. Add an instance, and you will increase the computational power. But by which ratio?
For what you know of the architecture, if you, say, start with one dash/core, and add a new dash/core, you’re not really doubling the potential of the system, because there are many factors to be taken in consideration. For example, say that for bad luck most of the hardcore users logged in dash/core #1 and they’re launching tests manually like crazy. Or say that dash/core #1 received two fat tasks that are taking a long time to execute. Or say that dash/core #1 has a lot of records to write in the database, piled up in the actor’s mailbox.
Or even worse, say that dash/core #2 is incapable of communicating with the database for some reason, while dash/core #1 is.
Long term stats will tell you that adding a new dash/core doubled your computational power, but that it’s just an approximation. It doesn’t mean that in the day-by-day work we’re experiencing the full potential of what we’re paying for!
WARNING. Get ready for a massive scheme. It was probably unnecessary, but I wanted to impress you with my charting skills.
In the picture, we have two cores. The idea I’m trying to represent using the blue lines is the capability of the actor routers (which don’t have a graphical representation here, but they are there) to determine whether the actors are prompt to execute a task, or buried in work.
As the two services exchange information on their status, they know whether the counterpart on the other service is in a better condition, and if so, the work gets forwarded there.
Communication overhead apart, you can now leverage the full potential of your setup, by distributing the workload evenly.
Conclusion
No actors, nor rabbits were harmed during the development of API Fortress.
Me? I’m older, wiser. Still daydreaming about the next big fat thing.