My previous post concluded with approaches to enhance the success rate of all sorts of dates. This one will discuss the next step, breakups. Let’s start with a story.
In the beginning, I put the whole back-end on one machine (A), and the server application talks to the database through lo innocently. Until one day, the disk stopped working, and everything was gone. I realized a ”single point“ is risky. So I duplicated the server executable on another one (B) and separated the DB on an individual C. In order to enable the simultaneous access to both A and B, I configured a nginx reverse proxy on D with the other side talking to the outside network. Until one day, the disk on C stopped working, and everything was gone. I realized that every ”single point“ in a system is risky. So I added a secondary DB instance on E and configured a master-slave. In addition, I set a Memcached to speed things up.
I believe the realistic architecture will be able to survive for a few month in this fictitious story, so I will use it as the sand table for the following discussion.
Breakups, as the name might suggest, can be used to prevent back-end breakdowns.
Putting everything in a monolithic executable is considered a bad practice and people tend to break the functionalities into smaller components, or so-called microservices. This practice enlists additional inter-process / network communication and imposes service management complexities. So why bother? And how to break up in good manner?
The first time I felt the impulse to break the backend application up is after the architecture is somehow adopted by an online-game team. The development pace becomes faster as new versions are released on a weekly (if not daily)basis for new features, promotional activities, and rule adjustment.
One day, I was asked to adjust the rule of a ranking sub-system. The PM changed his mind very often and this is the third time. But this time I unconsciously coded a null pointer exception which can be only triggered by some very subtle conditions. It is so subtle that all tests are failed to find it. But the bug was eventually found in the production environment and the server instances on A and B turned into core dumps one after another at night (when the conditions are met I guess). At that night, the throughput dashboard makes grimace to me when I was woken up by alert messages 3:00 am. The grimace looks approximately like this.
I fixed it on the spot. But I worried about future sleepless nights for every new release. And I also worried about worrying about it in future sleepless nights. At least not for this ranking service. In fact, it is not very important as not too many users care about their ranking (only the top 50 maybe). It is not even included in the critical path — if something goes wrong, the server simply returns empty and a user will see a ranking stored locally. So the incidence is like a stereo system glitch crashed a whole car and I decided to isolate this stereo system from the main body. Technically, I moved all the ranking logic to a new service process, refactored its public functions with RPC and kept everything else almost the same. Now an exception in ranking sub-system will affect ranking sub-system only, and I will be able to sleep better to create fewer exceptions.
Rule number one, no-critical functionalities, and those are released frequently should be separated from critical ones.
I remember the lesson of the single point of failure, so I duplicated multiple instances for the ranking service and make RPC component capable of load balance.
Service as Software Modular
After this, I continue the practice to isolate the risk in smaller granularities. I borrowed the idea of modular programming and grouped relevant APIs together in one service and found that it also made development and debugging much easier as I can view all the relevant code in one place without the distraction from code that is irrelevant. Oh, and compiling is much faster than before because less source code is involved. As a result, the original server application is reduced to a thin APIs layer that merely aggregates returns of the services which now form a new logic layer.
Rule number two, a server application can be divided into services in the same way defining software modules.
Slow V.S. Fast
I was happy with the architecture until one day when the play money APIs goes too slow in high load. Generally speaking, this group of APIs writes changes to cache which is in turn flushed to disk occasionally in an asynchronous manner. Writing to cache should not be slow and other APIs that write to cache, such as experience APIs, do not become slow like this.
The problem is solved with another breakup after a better understanding of the service. In this game there are two types of currencies, 1) play money that user can earn by playing the game and 2) paid money that user can only buy with $. For the paid money, one request incurs synchronous writes to both cache and disk, and the response waits until all the transactions and the changes are saved in the DB. Dealing with secondary storage, paid money APIs are considerably slower than play money ones. I overlooked the speed differences and intuitively grouped the two kinds of money APIs in one service based on rule number two. As a result, the slow real money APIs occupies the resource (I mean, threads) shared with fast play money APIs which are blocked like cars blocked by bicycles sharing same lanes.
Rule number three, slow services should be separated from fast ones.
If you like this read please clap for it or follow by clicking the buttons. Thanks, and hope to see you the next time.