In this article, I will share the 4 most significant changes we made to our platform and apps in Telenor in order for deploy new features as quickly as possible. These are not actually meant to be rules per se, but rather adjustments that has worked out very well for Telenor which I think can be applied in most organisations running OpenShift/Kubernetes or not.
Why does it matter?
I am by nature a very impatient person, and having to wait longer than necessary for a change is made until the end user can use it does not make sense to me. 🤷
From a business and economical perspective, shaving off idle/waiting time by even minutes has a great positive impact on the company when you multiply the number of code changes with the waiting time developers have to spend just waiting for features to be deployed, which in itself can be frustrating. Not to mention dependencies between teams. E.g. Team A has to deploy something to production that Team B requires in order to deploy their stuff. This inter-team dependency cost is often ignored in large organisations. 💰 Also worth mentioning is context switching involved if the waiting period becomes too great to just wait out so you have to start doing something else while waiting for a deployment.
There is no doubt that OpenShift has contributed hugely to the success of deploying apps in Telenor, but most of the points I make in this article are problems which are not solved by OpenShift. After reading and implementing these suggestions, you will be able to deploy code to production with confidence.
1. Don’t be scared of deploying to production
This is by far the most important experience and probably the one that cannot be solved with only technical improvements. If you have ever deployed code to production, you have most likely experienced the uncertainty that everything might not work as you expect. Often you have to assert that it works by doing some manual testing, and still you have some sense that it might not work for all end users. That has to stop, now.
The deployment dashboard
One of the big pain points related to deployments in Telenor, was that developers did not know exactly what changes were going into production. Running on OpenShift, we only knew that an app was being copied/promoted from the test cluster to the production cluster. If it was the same person who deployed to production all the time, he/she probably remembered what went into production the last time and could make an educated guess what would make up the next deployment.
However, if you want your whole team to deploy with great confidence, you need a way to visualize what is currently in production, what is currently in test and what will be deployed from test to production.
Our simple deployment dashboard is custom made, but there are other off the shelf products like Spinnaker and GitLab which provides what you need plus tons of other features. We did not want to invest a lot of time and resources into integrating something big, which is why we spent a couple of days creating a custom tailored dashboard which showed us what features were going to be deployed to production — it is like a to-do list with the tasks already completed, waiting to be ticked. ✔ Everyone can make one of these dashboards, and I would say it is a requirement if you want to deploy rapidly to production during daytime. ☀️
Implementation TLDR; We used Jenkins to build the image and set Git metadata on the app artifact, in our case, the container image. We could then use the OpenShift REST API to get the commit id from both the test and production cluster given a container image. Then we used the Bitbucket API to get the commits (diff) between the two commits and display them in the deployment dashboard. In that way developers would know at any time which changes would go into production based on the commit messages.
2. Make your Git workflow as simple as possible
Adopting Gitflow workflow has some benefits, but it introduces even more problems:
• Manual work creating and merging feature, release and hotfix branches into develop and master branches — do we really need 5 different branch types? 🤯
• Cumbersome versioning work — do you really need to version the Git branch of a web app and microservices if it is backwards compatible? OpenShift also provides rollback mechanisms if a «release» needs to reverted
• Release branches all too often become enormous and the release date is usually set beforehand, creating greater risk when deploying because everything has to be perfectly in sync, which is hard in a big organization
We decided to follow an increasingly popular trunk based development model. If we take advantage of short lived branches which gets merged into master as soon as the feature is ready, we can spend less time wrapping our heads around the current state of all Git branches, and more time creating new useful features. This requires one to only merge production-ready code to the master branch. It took a couple of weeks with trial and error before everything went smooth and the learning curve somewhat steep, but in the end, trunk based development has been the standard for all future apps deployed on OpenShift.
3. Build your app using the right tools
Whenever you make a change to the code and push it to version control, it should be built immediately after to shorten the feedback loop in order to shift left to detect problems as early as possible. In short, shift left is a practice in which software components are small and will be built, tested and deployed as soon as possible to prevent surprises at the end of a development cycle.
Most organisations have a CI server which builds the applications, which is something you should continue doing. What we need to look at is the tools used to build the deployable artifacts.
Jib: Building Java container images on OpenShift/Kubernetes
Note! This section might be a bit technical if you are not currently running your apps in containers. 👷🏻♂️
We started out using Docker for building on OpenShift, but it was sloooow and we and it didn’t cache the Java dependencies. We also didn’t use multistage Docker builds on OpenShift because OpenShift was using an old Docker version, so we had two builds; one for building the runnable JAR file, and one for building the container image itself. They were running sequentially, and the start-up time for the builder Pods added extra time. This was way to boring to out-wait and we started looking into alternatives.
We tried Kaniko, a tool by GoogleContainerTools on GitHub. It works in the same way as Docker but without the infamous Docker daemon. You would start a Kaniko (slave) agent on the cluster through the Kubernetes plugin for Jenkins. We thought we had found our optimal solution. Kaniko is great in theory, but it requires a great amount of time to set up correctly so that it worked with all apps. You still have to write a Dockerfile, and writing one which fits all Java apps was not possible. Another disadvantage of using Kaniko is that it is cumbersome to build apps images locally on a developer machine, so we decided to look for other alternatives. 🔍
Fast forward 4 months and the entire fleet of Java apps were switched to use Jib, a project by GoogleContainerTools, which lets you build OCI/Docker images without Docker itself. You only need the JDK, which is needed anyway to build Java apps. Big win for us! The old build system with two individual builds used approximately 15 minutes for the largest legacy app, while the new Jib-based approach reduced it down to around 5 minutes! Great success! 🎉
4. Adjusting readiness probes with precision
If you have worked with monolithic applications you have probably experienced that they have a significant start up time from when the process has been started, till it is ready to receive traffic. It is not uncommon that it might take several minutes, and if you consider how many deployments you wish to deploy every day throughout a year, that adds up to well… a lot of waiting time.
If you are using Kubernetes or OpenShift…
The Official Kubernetes documentation on readiness probes describes readiness probes as follows:
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
What did we do?
We looked at our possibly largest legacy app which was already running on OpenShift. Starting out, we registered that the maximum startup time for the app was 4.5 minutes at its worst, so we set the initial delay of the readiness probe to start checking our app after 5 minutes had passed just to be sure.
However, sometimes the app started in less around 2 minutes, and sometimes around 3 minutes, but still the highest startup time stood unchanged to make sure the app was not marked as unhealthy, and then terminated, which by default happens if 3 readiness checks fails consecutively.
From the Kubernetes documentation on startup probes:
failureThreshold: When a Pod starts and the probe fails, Kubernetes will try
failureThresholdtimes before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.
In OpenShift and Kubernetes, it turns out that you can specify the failure threshold, e.g. so that 9 readiness checks might fail, before the app is marked for termination.
We set the initial delay (
initialDelaySeconds) of the the lowest registered start-up time which was 60 seconds on a good day, and then we set retry interval (
periodSeconds) to 10 seconds. The failure threshold in seconds was then calculated to be:
(highest_initialDelaySeconds — lowest_initialDelaySeconds) / periodSeconds
Or in other words:
(300 — 60) / 10 = 24 was our new
With this adjustment in place we managed to reduce the deployment waiting time even further, reducing the time from 5 minutes to potentially 1 minute for every deployment! Victory! ⏳🎉
More to come?
We have looked at different conceptual ideas and concrete technical improvements that can aid in making deployments a little less scary, some of which are low hanging fruits and some of which require cultural change.
How can we make deployments suck even less? We have started looking into feature toggles which might reduce the risk even further and allow for greater variance in regards to AB-testing.
Thanks for reading this far and stay tuned for more articles!
Øyvind Ødegård, DevOps engineer @ Dfind Consulting
Dfind Consulting er et konsulentselskap med fokus på systemutvikling. Vi skaper utviklingsmuligheter for…
To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇