Designing a Code-Deployment System (Question from AlgoExpert)
I’ve been doing a lot of reading lately on designing large distributed systems and I came across the first question asked on the AlgoExpert platform by Clement Mihailescu under the SystemsExpert module. You can find the question here.
Before diving into the solution given by the course, I tried my own approach and see how it would compare to the instructor's answer. Needless to say, my approach differs in quite a few points and I wanted to share my take on it.
The problem statement
The question reads:
Design a global and fast code-deployment system.
That’s it. Well, that’s not a whole lot of information, but it’s ok as the platform is designed to simulate a real interview question. So we need some clarification on the requirements. Here’s a few questions I came up with:
- Does the system handle repositories?
- Are there a fixed number of stages (e.g. build and deploy) or is it customizable?
- How many triggers are expected on any given day?
- Does it have to be highly available? (e.g. can it afford to be down for let’s say a total of one week over a year?
The course actually comes up with more clarifying questions, so do check them out.
Long story short (if you don’t want to watch the video) here are the actual requirements of the system:
- It will handle only build and deploy.
- It needs to be available most of the time, but since is an internal tool, we can afford some downtime.
- The system must account for a historical record of the builds/deployments
- The system is expected to run around 5000 builds/deployments per day
- It has to be code agnostic
- A build should take around 15 minutes
- It should be triggered when a commit gets merged into master
With that in mind, let’s try and design a system that handles all the cases.
Splitting the requirements
Given the requirements, we can immediately identify 2 separate parts of our system. The build and the deploy services. Let’s start with the build module.
The Build module
The build module will be the starting point of the process. We need to consider a few things:
- We need a history of the builds.
- We need to store the build binaries somewhere.
- We need to get informed when a build request is coming, and we need to inform any listening services when the build has been completed (whether successful or not).
- We need to account for horizontal scaling
So the first point speaks for itself. We need a database. I considered a SQL database, like mySQL or postgreSQL, as the data is fairly structured and will account for transactions via the ACID properties. That should take care of the concurrency problems. The Database will feature only one table and will have the following schema:
- sha: is of type VARCHAR and will store the SHA of the commit
- machine_id: is of type VARCHAR and will store the id of the machine that’s executing the job
- status: is of type VARCHAR and will store the status of the build, namely: RUNNING, COMPLETED, FAILED
The binaries can be stored in any blob storage like S3 or Azure blob storage. As simple as that.
To handle the incoming requests and to notify the status of the build, we can use a queue system like RabbitMQ. We would effectively have 2 types of events namely: build:init and build:completed. The first will be consumed by the service whenever there’s a new commit and the second will be published when a build has been completed, regardless of whether it failed or succeeded. We might want to run a single job per service, so when the service receives a build:init, it will cancel the subscription to the queue until the build has been completed.
Using async communication then, it would be fairly easy to scale horizontally as for every new service spawned, it will register to the queue system to a particular topic (or channel) and start listening for events without affecting other pieces.
Here’s a diagram describing the process:
But what if the build service goes down for whatever reason? Very good question! So if the build service goes down in the middle of a build, we would have a build in perpetual RUNNING status. Therefore the client will be left hanging. Not knowing. That’s not a good feeling. So how can we mitigate? One idea is to use Health Checks on the build services. Every x seconds a specialized service will perform health checks synchronously to each build_service and if it times out, well we can simply assume that the build service died and we can mark the database entry as FAILED. This would involve coupling the health_check_service with the build_service, but I think in this scenario it’s ok, as we can see the health_check_service as an extension to the build_service only that it lives somewhere else to avoid being dragged down. So we can update our diagram like so:
The Deploy module
I think I would be very redundant if I explained everything in detail for the deploy module, as I would take the exact same approach as before except for one difference. This time we don’t need a blob storage and obviously, the name of the database and the names of the events are going to be different.
So to recap quickly:
- The events expected will be: deploy:init and deploy:completed
- The database schema will be the same, except this time the name will be deployments
- There won’t be any blob storage
Here’s the diagram:
Combining the modules together
Now that the 2 modules are ready, we need to combine them into a single system. We haven’t talked about how does the whole thing get triggered. According to the requirements, the pipeline is triggered whenever a commit gets merged into master. We won’t cover that, but usually, repository managers like Github or GitLab do give the option to set custom triggers. We’ll assume the request will come through that.
Do we need a Load Balancer? I’m not sure. Considering this would be a pipeline, there’s only one entry point, and the requests will be balanced by the queueing system internally. So maybe in this particular case, it makes little sense to use a load balancer to distribute traffic. Maybe a simple reverse proxy is more suitable for this particular scenario. If we want to scale globally, we can simply (ah… simply…) replicate the system in different regions.
So using a reverse proxy, we can set it so that when a new request comes in, it performs an HTTP request to the queue system to publish a build:init event. It is then consumed by any of the build_service and it then cancels the subscription to the queue. After completing the build, it will reinstate the subscription and publish a deploy:init event. Similarly, a deploy_service will consume the message and perform the same steps as the previous system. Once completed, we could take advantage of a notification_service (e.g. email, sms, whatever) to inform the user that the pipeline has successfully (or not) completed.
Furthermore, let’s not forget about the history of builds. We can build a data aggregator service that aggregates the status of the pipelines all in one table to be queried at any given point in time. It can consume *:completed messages and store them inside its table and update the content whenever a new stage of the pipeline is completed. It won’t certainly have any concurrency issues, because I’ve never seen a build and a deployment finishing at the same time. The aggregated data can then be queried via a different client
For the sake of brevity, because believe me, I’m more bored writing this than you are reading, here’s what it could have been done better:
- We haven’t talked about the data residing in the blob storage. Unless we have infinite money and your provider has infinite storage space, the binaries should be deleted after each build.
- Since now we have a data aggregator service, storing the data in each service, seems quite redundant.
- This system adds more complexity as we need to keep track of the machine executing the job and we have to manually cancel a subscription to the queue and register again. Maybe a polling system to a simpler queue would have been better?
And here are some advantages:
- Highly scalable, adding or removing services won’t impact how the system behaves.
- Highly available. If any of the services goes down while performing operations, the health_check_service will take care of that case.
What do you think? I would love to hear comments on things I missed out or blatantly got wrong. How would you improve this design? Let me know in the comments.
Thanks for reading.