Implementing CI/CD at Kloeckner-i — A History Lesson
How It Started
At Kloeckner-i we had the opportunity to start on a green field with the whole development department — ideal conditions for creating the workflow we always wanted. A service architecture, Docker, container orchestration, zero downtime deployments and all these fancy words.
We had the idea of creating standalone review environments for each feature branch. The environments needed to be as close to production as possible. In the end we came up with these requirements:
- Each commit triggers our GitLab pipeline and this starts a deployment.
- Each feature branch gets its own review environment.
- For the master branch and production environment the deployment works exactly the same — each commit to master also starts the deployment to production.
When we started this project in 2016 the concepts were quite new for all of us. Of course there were books and thousands of blog posts about this, but reading about it and actually implementing it are very different things. Thankfully we had the support of business and product owners. Nevertheless there were still several sessions with discussions about the new workflow — the usual tasks when bringing devops culture into a company.
First Implementation
We started with a small team (4 developers, 1 devops) but had a big vision, so we needed to take some shortcuts. One of these was Rancher — during that time a wrapper around different container orchestration solutions with an easy to use web GUI and a powerful API. Rancher also had the concept of different environments and templates — perfect for our use case. So we had the pipelines in GitLab and the platform to run our containers. But the layer in between was missing. Where should we put our logic for the new workflow, how to communicate from the GitLab runners with the Rancher API? We just needed to fill the gap between both with a small tool. Shia was born.
Nothing worked perfectly from the start; however, business wanted to go live with the first application, Direct, so we just started with production and integration environments. The creation of review environments on the fly was not usable yet. There were several reasons. One big blocker was the lack of support for secret management via the Rancher API. We were in close contact with the Rancher folks, tested beta features and finally got the new version with the secret feature we needed. Also some publicly available Docker images we wanted to use didn’t support secret injection as we needed so we just jumped in and added this feature (e.g. here and here) — which was thankfully accepted quickly.
Learnings
Our new application was the sum of some services (something between SOA and microservices) — a user-service, customer-service, mail-service, etc., two Javascript frontends, RabbitMQ, Postgresql databases and for monitoring/logging: Kibana, Prometheus and Grafana. Everything packed up in docker containers and each of them with its own Git repository and own deployment pipeline. Just perfect.
Perfect until developers want to rollout big features as one release. Frontend and related service changes could not go live at the same time. And we didn’t want to have someone hitting two different merge buttons in GitLab at the same time, either. Actually I remember having longer discussions about this during the definition of our workflow in 2016. We were aware of this problem and also welcomed it. It pushes the developers to create smaller atomic changes at a higher deployment frequency. We even trained the newcomers (our team was growing quite fast to 10 developers and 3 devops) on this, nevertheless still the best way of overcoming old habits is by deploying and breaking production with your change. Learning by doing… and everybody learned this rather quickly ;-)
With a higher deployment frequency and fast moving small parts, the probability of problems, bugs and errors increased. So we needed a very fast and easy way to rollback the software. Here we just reused our normal deployment workflow. Rolling back the user-service by one version, for example, involved picking the correct commit from the history in GitLab and re-triggering the deployment step from that pipeline. Of course after checking if everything is fine and the problem is solved you still need to cleanup the Git repository by reverting the master branch to the version that was re-deployed, but the rollback itself just depends on how fast your actual deployment is. For us this takes less than one minute, on average.
Some side note: We did not need to use this rollback feature very often — maybe once each quarter. All these things work hand in hand to get this robust production system: Deploying only small atomic changes, the pipelines, the review environments, automatic acceptance tests and of course the people that are living this process.
Time for Something New
By the end of 2017 we already realized that some of our shortcuts from the early days would come back as bigger problems. The tools we used for deployment (Rancher, Shia, GitLab) were tightly coupled and would only fit to new requirements from our business with bigger changes and refactoring. Also the maintenance of Rancher took some time. To rollout a new version we needed almost 1 week of testing.
At the time we were not only running Direct but also Part Manager and a third tool called Consignment. They all use our service architecture, the same customer data, same RabbitMQ and mail service. We created this for our American branch but then a new player entered the game — Becker Stahl, a German branch of Kloeckner. They had different customers, different data, different backend systems and wanted a completely different tool: The Becker Order Book. This and the need for separating the German from the American customer data raised the idea for having different clusters for different applications.
In the best agile manner we accepted these challenges. We wanted to solve all of them at once with a bigger project. The developers and QA didn’t really need the help of us 3 devops, they could work on their own: Deploy, rollback and, from time to time, break production. Monitoring, logging, backup, OnCall…. everything was in place. So we were free to start.
Kubernetes was already in our heads. Even when we started in 2016 with Rancher, we checked out the Kubernetes integration. However, during that time Rancher played better with Cattle and also the developers were more into the docker-compose like files of it. But by the end of 2017 even Rancher was preparing the next big step: Version 2.0, completely focusing on Kubernetes.
Kubernetes It Was
Instead of waiting for the Rancher 2.0 release and afterwards maintaining several of these instances that were managing the Kubernetes clusters for us, we decided to just use plain Kubernetes. We were already hosting our machines in Google Cloud Platform and could directly use the Google Kubernetes Engine for our clusters. That shrank the maintenance time and costs by a lot.
However, our knowledge of Kubernetes was limited and it is always complicated to use a new tool correctly if you do not have some experience with it. So we had a workshop session with an expert and hired 2 new Devops, who already were familiar with running Kubernetes in production — and believe me, this was not an easy task.
Next we created the requirements for our new tool. In the future, when the product people come to us asking for a new application for a Kloeckner branch, we would need to be able to put this into an existing cluster or create a new one including tools, workflow, etc. within 1–2 days. Each cluster should look and feel the same (namespaces, ingresses, monitoring, logging, deployment). That makes it easier for developers to jump between projects and also removes the pain of documenting and maintaining each special case.
Kubernetes was set and our GitLab with the pipelines did a great job but the part in between was missing again. We abandoned Shia and started a new tool for this: Stanley. The first implementation was a simple Python command line tool started on the Gitlab runners. We immediatly switched this after a short time to the Kubernetes Operator approach. Now Stanley is a framework consisting of different operators and controllers mainly running inside each cluster.
In the end, our actual deployment is just a simple call against the Kubernetes API . The deploy step in the pipeline creates or updates a custom resource with all needed information for the Stanley Operator: Git project, Git commit, Git branch and the desired namespace. The operator waits for events of this custom resource and applies the changes. The operator creates new review environments when a corresponding event comes in and tools like our DB-Operator and secret syncer setup everything for the application.
The Becker Order Book had a new home in Kubernetes and the first version of Stanley was more or less working. Now we needed to migrate Direct, Part Manager and Consignment from Rancher to another Kubernetes cluster. We converted each service to Kubernetes and added parallel deployments to Rancher and Kubernetes for each of them before we finally moved the data and DNS entries. In April 2019 the move was done, the project a success and everybody very happy.
And suddenly a beast of a problem revealed itself for the first time: SAP Hybris. But this story I tell another day.