Testing against a forest of dependencies by separating them into trees — Utilizing K8s operator pattern

Published in

Agoda Engineering & Design

6 min readMar 25, 2020

Have you ever faced any issue on updating multiple dependencies?

In Agoda we have a microservice architecture that’s great for scaling any service independently. But how can we manage the version of all services that we scale?

In our testing system requires ~20 services to be deployed for running regression tests. Moreover, all services should be up to date as much as possible… so how did we update the services in the past?

Previously, when any service released a new version, we listed all new versions and upgraded them together. Unfortunately, upgrading was never 100% smooth and resulted in us spending a lot of time investigating and identifying problems which we would then escalate to the service owner.

Here is the descriptive example that we used for upgrading versions before. Firstly, we used the latest tag version for all required services, but once there was any problem; testing failed or deployment failed, so we had to investigate the problem by ourselves and try to specify tag version randomly.

Randomly revert one by one until finding success one

Secondly, the services sample above looks simple as there are only 3 services running together. But in reality 20 services are reverted randomly without any control which is not acceptable… Furthermore, how do you know your new version is failed from itself or environment issue?

As DevOps, we have a responsibility to improve those workflows to be faster, more reliable and easier for any teams in the company to integrate. Consequently, we designed and implemented new system called Samsahai to serve those purposes

What is Samsahai?

Samsahai is a Thai name which means three friends; in our system they are Kubernetes, Go and Docker. Samsahai system runs based on Kubernetes CRD (Custom Resource Definition). CRD is the Kubernetes API that can be extended to create our own resources.

The concept of our Samsahai is to have 2 environments: staging and active with verified service versions running in parallel. As we should not always respect the latest tag version, is it better to have the system for helping us verify a version and scope the problem?

In the next section environment will be referred to as namespace, while service will be referred to as component.

Staging namespace is the environment for verifying new component version.
Active namespace is the environment for running developer’s pull request before merging to master, this namespace runs all passed component versions.

Feel free to check it out from Samsahai Github!

CRD (Custom Resource Definitions)

As I mentioned before, our system mainly uses Kubernetes CRD to control the workflow, here are the new kind resources that we created:

ClusterScope
1. Teams
→ Monitor all events that happen to particular integrated team
2. ActivePromotions
→ Monitor all active promotion events that are currently happening
3. ActivePromotionHistories
→ Store active promotion logs of particular active promotion

NamespacedScope
1. DesiredComponents
→ Store latest version of particular component
2. Queues
→ Order the queue of component verification
3. StableComponents
→ Store passed version of particular component
4. QueueHistories
→ Store queue logs of particular component verification

Our workflow in simply

QA of Avengers team: I want an environment to run regression test with up to date services everyday, what should I do?
DevOps: You just need to open the pull request for creating your team environment and also listing your desired components.

Avengers team pull request

Once the pull request is merged, Teams CRD will automatically creates avengers namespace by using kubectl command.

kubectl apply -f avengers-team.yaml

As I mentioned before, you need to specify your desired components. To specify them we need only one configuration yaml file. By the way, I will split the configuration piece by piece. Here is the example of how to configure your deployment:

Example of desired components configuration

Next, I’m going to explain how does our CRDs that I mentioned above work.

The DesiredComponents CRD are triggered by webhook. Next, add to the Queues CRD and the new component will be deployed one by one from the queue. Once its verification is passed, that new component will be added to StableComponents CRD following 1st workflow.

In 2nd workflow, the component verification is failed. Thus, it will be re-queued again and go to the next queue 3rd workflow.

Finally, the last queue 4th workflow is passed the verification, queue will be empty and will be processed again when the new desired version is coming.

Example of staging and active namespaces flow configuration

What’s about the failed Wordpress v.3? No worries, we have the configuration of retry time that I have added in gist above. If its verification continues to fail until the retry time Samsahai will begin re-verification process by redeploying all stable components for Mariadb v.3 queue. If it is still failed, the problem can be classed as environment issue.

Now, it is time for active promotion.

Scheduler to trigger active promotion flow, the new active namespace is created

Firstly, when active promotion has been triggered, it will create new namespace called pre-active with team name and random 6 characters like image above.

**avengers-abcdzx** is called as pre-active namespace before switching to call active namespace

Secondly, all stable components from staging namespace will be deployed on this pre-active and tested again. If everything is alright, this pre-active will be switched to active immediately.

On the other hand if pre-active verification fails… that’s fine your existing active namespace will not be destroyed until you’ve got a new ready one.

Transparency

How do our users know the verification status easily?

Earlier, we have added the configuration of components, staging and active promotion to our yaml file. Lastly, we would like to add a reporting configuration.

Example of report configuration

By now Samsahai supports slack, rest APIs and also shell script by integrating with Go template which can be customised. The component upgrade configuration lets you configure how often you would like to retrieve a notification, the possible values are commented in the gist above.

Notification of failed component upgrading

Once the verification flow has been done, the notification will alert in slack channel. Last but not least, Samsahai provides APIs by using Swagger. Consequently, the users can get CRDs data freely.

Notification of successful active promotion

Finally,

After all sections are configured, we have got one big yaml file below. Now, we can apply it via kubectl command and Samsahai will do the work for you.

Example of config.yaml

kubectl apply -f config.yaml

Find more configuration in Samsahai Example Github!

Why Samsahai?

Samsahai system is a helpful tool for software engineers to manage their service versions. No longer will you have to waste time updating dependencies and spawning complex environments to the new ready one. Nevertheless, it is also easy to integrate and monitor.

Currently, a lot of teams in Agoda has integrated with our system and we have got a lot of active feedbacks from them.

Lastly, as our company uses Kubernetes successfully for a couple of years, we think it’s time to open source this project and contribute it back to the great community within and around Kubernetes. Please feel free to checkout Samsahai Github if you are interested. All contributors are welcome.