Lessons learned with Gitlab Runner on Kubernetes
At 90 Seconds, we use Gitlab to host our internal source code and run CI/CD Pipelines over the built-in Gitlab CI. As our engineering team grows, the need of having a proper CI/CD setup increases, especially when it can speed up the entire development process by multiple folds. Earlier this month, we decided to take a good look at the current setup and challenged ourselves to improve it.
The Setup
Initially, a fully-managed solution that works out of the box was everything we need to get the ball rolling. Gitlab has a really nice integration with Kubernetes thus it took us just a few clicks to have a fully working environment.
However, over the next couple of months, we have found several drawbacks:
- Performance: by default, Gitlab CI gave us a 2-node setup without auto scaling. This didn’t work very well with our team, especially when the workload fluctuates
- Cost: although the amount is not very significant, it might not justify the value we get out of it
For those reasons, we explored an option of having a custom Kubernetes setup, and our initial assessment is that the auto scaling feature can help spin up or down new runners in no time but cost much less.
We used terraform to spin up new GKE clusters, with some extra customization needed for our environments.
- Auto Scaling is enabled, so that we only use resources when we actually need them, GKE would scale down the node pool to 0 instances when no job or build is active for a period of time.
- Preemptive option is used for one of the node pools which serve the less important but time-consuming build tasks. For that reason, we still have the same amount of resource allocated but for the cost of 5 times less!
With the above infrastructure setup, our next step is installing gitlab runner. We decided not to follow the official documentation but instead, utilize a helm chart for its flexibility, which will also make things a lot easier to upgrade and maintain in the future. Here’re some prerequisites in order to proceed:
- Gitlab registration token which could be retrieved in repository configuration
- A bucket to store build’s caches
- A service account to access that bucket (can use terraform for that too)
- A service account key to be used with helm chart.
And finally, the deployment values for helm
We can either use Terraform’s Helm provider or the CLI to proceed
helm upgrade --install gitlab-runner <chart-path> -f gitlab-runner.yaml
Check the deployment status
$ kubectl get podsNAME READY STATUS RESTARTS AGE
gitlab-runner-112233-4455 1/1 Running 0 14h
Check in repository setting page, we see a new runner appearing in the list
Then try to start some builds, and see our runners in action
runner-hagzat7x-project-123 4/4 Running 0 27m
runner-hagzat7x-project-456 4/4 Running 0 23m
runner-hagzat7x-project-789 4/4 Running 0 13m
We have just finished setting up a Kubernetes cluster specifically used by Gitlab Runner!
And based on the calculation, this setup will only cost us 100–150$ / month, a significant reduce!
Reduce Build Time
With the previous setup, a standard build usually took 28+ mins to finish, which is quite slow, and we don’t want our folks to wait for too long before they can continue their works.
Ruby on Rails is our primary choice for development. As we all know, a RoR application depends on fair amount of Gem libraries, and a CI build takes significant amount of time to install those gems, so caching would really help to speed up this process.
There was a cache configuration already in place. However, since we spread the workload across many runners, the default local caching doesn’t reduce build time much.
We spent some times to get some ideas how shared caching works. Initially since our workers are just Kubernetes pods, all the local caches would be discarded upon the new build. For that reason, we need to store the cache elsewhere in a centralized storage, such as Amazon S3 or Google Cloud Storage. Let see how it looks like
And the result
Much faster right? But less time for our folks to chill now :(
Final Diagram
TODO
Although we have a good start on improving our CI/CD pipeline, there are still many things left to enhance:
- There are some concerns on using Docker in Docker (DinD) for publishing docker images. Might be better if we just use physical instances placed inside an auto scaling group, which probably have less overhead and a bit better in performance, but with the cost of management.
- The amount of time runners spend to pull the images should also be taken into account. Probably we could have some in-house caching docker registry to speed it up.
As has been pointed out, Gitlab CI is quite flexible to work with as it provides a lot of functionality. Despite the complexity, there is no doubt that keeping CI/CD pipeline reliable and stable is an important task which every organisations or companies should focus on.
Tommy (Tuan) Nguyen — Kubernetes Enthusiast — Senior DevOps Engineer @ 90 Seconds