Happy helming at Docplanner

Published in

Docplanner Tech

8 min readJan 16, 2020

What is the Docplanner platform?

Docplanner is the largest healthcare platform in the world, with 30 million users per month. Services are available in 14 countries, from Asia (Turkey) through Europe (Poland, Spain, Italy) to Latin America (Brazil, Colombia, Mexico). The database contains 2.4 million doctors, and patients generate over 2 million visits each month. What is important, thanks to SaaS software, doctors and entire clinics are able to reach patients easier and faster and can improve their online presence.

How do we develop software?

Our product is developed by over 100 developers in two cities:

Warsaw — in this office we focus mainly on the part of the website dedicated to patients and reservations (marketplace)
Barcelona — here developers primarily build SaaS for doctors.

From a technological standpoint, both teams share a common infrastructure. We have several Kubernetes clusters with a total of 100 nodes, and the 3 main clusters are located in the AWS cloud in North Virginia, São Paulo, and Frankfurt regions. We use Terraform, EKS, and kops to manage the clusters. The applications are written in various languages: PHP, JS, C#, GO, and due to this linguistic diversity, we maintain approximately 150 different deployments and over 2,000 PODs.

Let’s clear up our resources!

The linguistic diversity and rapid growth of the company caused a mess in our Kubernetes resources. At that time, our MVP was to ensure the application is operational. The team responsible for the infrastructure was not so big, and the developers did not have access to the Kubernetes API, so the problem of disorder was not pressing at the time. We did not have a central place from which we could manage the infrastructure, we did not have a single source of truth. We made changes manually on all clusters, which meant that not all applications were consistent. We synchronized our state with the kube-backup tool, which had git-crypt encryption support. Finally, at some point, it ceased to be enough.

The platform was constantly evolving, and new functionalities were created and added to it. We were not able to quickly deliver the final product — a microservice. We decided that we need to choose a tool that will automate our work and accelerate the flow. We wanted a tool that can introduce order following the idea of Infrastructure as a code.

As I mentioned before, we knew and used Terraform to create infrastructure (RDS databases, permissions, etc.). We were looking for a solution that would, at best, be supported by the same technology or would simply work in a similar way. It would give us an advantage at the start. An essential factor in the selection was the ability to activate our developers. We wanted to enable them to start or modify microservices themselves. All the time, somewhere at the backs of our heads, we kept Helm, which we used to install official charts, e.g., ingress controller. We expected that choosing this tool would provide us with the most transparent form of application description and would not pose difficulties for people inexperienced with Kubernetes. Our current Continuous Deployment process was also an important factor. We didn’t want to rebuild it completely but only adapt it to our needs.

The decision could not be hasty. We decided to do research and choose 3 tools for which we will create a proof of concept. At this stage, what helped us a lot was the visit to the KubeCon Barcelona conference in 2019. There we learned what Pulumi is or what Helm version 3 will look like. Next, we wanted to describe one of our applications and run it in a test environment. We wanted to list the pros and cons of each solution and ultimately choose the best one.

Round 1: Pulumi (https://www.pulumi.com/)

When we heard about it for the first time, we were very excited — in the end, it worked just like Terraform. We wanted to test it immediately. An undoubted advantage was the ability to use programming languages as we wanted. We could choose TypeScript, JavaScript, Python, Go, or .NET core. Importantly, Pulumi lets you manage not only applications in Kubernetes but also the entire infrastructure, e.g., EKS. Look for yourself how easy and straightforward it is.

example from pulumi.com

Our infatuation ended right after the conference when we calmly analyzed the materials from the lecture and explored the documentation. It turned out that the support for the Go language, which we trusted most, is only in the preview version and will not become stable soon. Nevertheless, we started testing.

We were able to run a simple application, although it was not easy. We encountered various types of errors that we began to seek help for. The documentation turned out to have many gaps, and all issues were related to the primary language — TypeScript. We began to wonder if anyone is using this software at all. It was challenging for us to find community support and get some real use cases.

Round 2: Kustomize (https://kustomize.io/)

We chose Kustomize for testing because it was a native Kubernetes solution for configuration management implemented in kubectl. Nice, we would not have to convert our CD pipelines significantly. We launched the application very efficiently.

Unfortunately, Kustomize did not meet most of our assumptions. The tool lacked the space from where we could download our basic YAML definitions. There was also no management of the deployment lifecycle — we perform migrations before updating the application or other key functionalities. It is a simple template tool.

Round 3: Helm (https://helm.sh/)

And finally, we looked at the tool known to everyone — Helm. We were aware of the upcoming revolutionary version 3; however, when we were creating PoC, we were relying on the alpha version that was not yet usable. We were able to describe the application and launch it with the use of a simple chart utilizing both the stable version as well as the working one. We were pleased with the effects despite a rather long time required to create it. We even ventured to check the Helm for Helm, i.e., helmfile, but it was not very interesting for us.

Decision and assumptions

The result did not surprise us — Helm was the tool for us. We had to develop an implementation plan and get started. We started by establishing detailed assumptions:

ultimately we need to use version 3
no tiller server
we want to create one generic chart that would allow us to launch our applications easily
a migration job should be done before the new application release
we need to determine the place where we keep all charts
charts should build automatically
sensitive values must be encrypted and kept in the repository and CD should decrypt them during the release
in our flow, we can skip the full build of the application and run only the upgrade release.

Implementation

Helm was close to the release of stable version 3, so it was understandable that we had to use it. As I mentioned earlier, due to the lack of stability of the alpha version, we had to choose v2 for our project. We did some research and gathered that migrating from v2 to v3 shouldn’t be a problem. Immediately after the release of the final version, the official 2to3 plugin was created, and thanks to that plugin, transferring configuration or states was effortless.

The plugin even allows you to migrate states without a working tiller, which was also important to us because we used the next plugin — tillerless. We did not want to install the tiller server on the cluster for security reasons. The tool mentioned above allows you to run the tiller locally when installing, updating charts, etc., and then turn it off. Tiller is not required to run all the time. All states are kept in the form of encrypted objects in one namespace. Also helpful was creating a simple alias:

helmtiller=’helm tiller run helm — helm’

To maximize the automation of our work and allow developers to decide for themselves on the appearance and operation of the application, we decided to create a generic chart. We called it docplannerlib. By adding it to the dependency of the application chart, we are able to create all Kubernetes objects needed for the application operation (deployment, service, secret, configmap, job, cronjob, hpa, rbac). We’ve also created other charts that we can use conditionally, e.g., redis, varnish, etc.

Here’s the set of values for the sample application:

In most of our applications, we perform various tasks before releasing a version. We run the job with a helm hook that creates databases for us, configures external services, runs migrations, checks whether the application works, etc. If this process fails, the release stops with the status failed. We are sure that it will not enter production.

As storage for charts, we chose s3. We thought it would be a universal place to do it. To have protocol support, we used a plugin. Charts are tested and then built each time their version changes. Due to the fact that we keep the charts together with the project, we have created a simple, readable structure:

All secret files are also stored in repositories and encrypted with git-crypt, not available to developers.

When doing the release, we use a kind of transaction, which allows us to skip the docker image build when the changes do not strictly regard the application code. Examples can include increasing the resources assigned to a given deployment, increasing replicas, or adjusting health checks. This transaction considerably speeds up the whole process, and for this purpose, we use Buddy software. The CD tool does not have Helm support yet, but it gives us a lot of customization options. We created a docker image that contains all the dependencies needed to run an action. The simplified flow looks like this:

Does it work?

Until now, many people have spoken negatively about Helm. They complained about poor security, the need to write their own templates, a lot of logic inside them, problems with maintaining IaaC rules, etc. Some proposed using other tools, while others even using Kustomize with Helm’s artifacts. Despite these warnings, we decided and successfully implemented it. Everyone is happy with the effects of our work that took several months. We have been waiting a long time for version 3, but it is here now, and most of the problems mentioned above have been resolved. It should please skeptics.

If you enjoyed this post, please hit the clap button below :) You can also follow us on Facebook, Twitter, and LinkedIn.

Recruitment alert! We are looking for a Site Reliability Engineer to join our Tech team in Poland or Spain. Check out our offers! :)

Poland

Spain