Cloudix -Building a multi-tenant platform service in the highly regulated insurance industry

Jonas Samuelsson
If Technology
Published in
6 min readMar 21, 2022

Starting to feel the heat from ever evolving cloud services and experimenting dev teams we felt the need to evolve our on-prem and server based mindsets. How would we provide a centrally managed platform service that would bring value for dev teams in the cloud?

In this article I will walk through what vision we had for building a multi-tenant system hosting solution (what we call our “hotel” service in the cloud that goes by the name of Cloudix). I will walk through the initial stumbling steps, the problems we faced and the technology decisions we made along the journey. In the end we ended up with a solution we did not consider at all from the beginning and solving many of the challenges naturally with those decisions we made.

Starting out three years ago we initiated the search for a platform on which we would base our hotel service on. We quickly realized that Kubernetes was winning the race. The developers in the company were starting to spin up containers and clusters for experimentations and gaining learnings every day.

We had good experiences as serving as an internal Ops-team and solving many issues with automation and a central hosting on-premises, so naturally we would like to be able to do the same in a cloud native version to enable the cloud tranformation decided in the company

The platform concept that I first envisioned back in 2019 was built on the multi-tenant concept. I knew what I wanted, I was just not sure of how to get there. Bringing all the best-practices we had learned from hosting an on-prem hotel and multi-tenant environment… but we had little experience of Kubernetes (K8s) and container technology.

The inital cluster design

As we were utilizing Azure as the primary cloud of choice in the company the choice landed quite naturally on the Azure Kubernetes Service. Since the service is branded as “managed” we thought it would be easy enough. In short we quickly learned that the “only” thing you get for free is the K8s management control plane.

Below you see our initial design of the systems with an internal build dispatcher service that would handle the deployments similar to the automation we had on-prem. We also set up the environments matching what we had in on-prem services.

We quickly faced some issue when rebuilding and managing the clusters. We were relying on service connections with saved secrets for each namespace and system, so spinning up all the solutions would be a pain. Some other drawbacks included:

  • Upgrading the clusters would involve work from dev teams.
  • The push based approach included many steps and manual interactions.
  • Cluster setup was done with Ansible and relying on trying to apply last good state in a semi-manual fashion.
  • Recreating clusters was relying on restoring backups.
  • We would bring down all the test environments and production potentially for a longer period when upgrading Kubernetes versions.

What are the considerations we have to make in our industry

It is really a balancing act when it comes to working in a regulated industry. On the one hand we have the rules and regulations from operating in the industry that we do. On the other hand we have developers and teams wanting to adapt to new tools and technology like in any other technology driven company.
Some of the considerations we have to make are a balancing act between having control and being able to move fast.

The redesign phase

We knew that we had solve the initialization and set-up of the clusters and be able to recreate them from scratch in a better way. We were uncertain of how to achieve this. We also had a feeling that it was not great to have various test environments in the same cluster either.

Enter GitOps. It was a term we had never heard of before and we started scrambling. The more we read, tried and learned the more we understood that this could potentially help us solve many of the issues we had been facing. Is GitOps a solution to every problem? certainly not, but it is getting us further with creating a desired state for our multi-tenant platform.

What does GitOps solve for us?

Having everything defined as code in Git repositories enables us:

  • Stability and having full control of the changes that we are introducing in our environments both from the platform and system perspective. Changes to the infrastructure are made in the same way, through automation.
  • Platform as-code but also a technology that extends the “as code” to systems.
  • Developer-centric experience for managing applications and systems.
  • Moving fast and being able to recreate clusters more frequently and quicker. Making sure there is a consistent state in various environments.
  • Disaster recovery capabilities from the start
  • Auditabilty or a track of ledger of changes made so that we can clearly state who has done what and when.
  • Access management & access security and control with RBAC and read-only roles for developers.
  • Less interdependency i.e. being able to update system and platform components individual and independent of each other with less downtime.

The “final” cluster design and tooling

We reworked the environments and number of clusters. We decided to start with Dev, Test and Prod clusters. Instead of having five environments in two clusters, we moved to three environments each in it’s own K8s cluster. These clusters are provisioned in different subscriptions also.
The clusters are Private Link clusters with advanced networking (CNI) enabled. In the end we have security demands and other requirements that made us take this route. It created some additional effort for us but worked well once we sorted out all the issues.

So where did we land with the technology choices. Below you can find some listed that we are using;

  • Linkerd as service mesh and mTLS provider
  • Nginx as ingress controller
  • Flux v.2 as GitOps enabler
  • Sealed secret controller from Bitnami for secret handling in the cluster
  • Kustomize for templating and native management
  • Azure DevOps serving to hold the projects and Git repositories
  • Network Policy from Calico to separate traffic in the cluster where required
  • Prometheus monitoring and alert manager for being able to natively output and define alerts as-code.
  • Azure Policy for Kubernetes (this is the only component we chose to leave out from the GitOps toolkit to start with)

Results, conclusion and final words

Don’t get me started but this is something I could write an essay about, but rather than doing that I hope that I sparked some interest.

At this time of writing we have our first critical system in production and more are in the pipeline. The months we have been running shows great stability, and increased speed for maintenance and development. The dev teams are catching on fast!

Would we have imagined in the start that we would be running open source components and tooling in this fashion in a Microsoft driven cloud? Not in our wildest fantasy!
Be open to new technology and influences and stay curious; this is probably our main learnings.

Watch our KubeCon 2022 presentation around the topic here also:
https://www.youtube.com/watch?v=urWojY1jxdc

--

--

Jonas Samuelsson
If Technology

More than 10 years experience of running platform service for internal dev teams. Strong believer in central platform services that provide value for agile team