Helm Templating: walking towards teams autonomy

Silvia Cobo
Clarity AI Tech
Published in
4 min readOct 4, 2023

One of our main goals in the SRE team is to make other teams as autonomous as possible. We want other teams to be able to self-service manage their own resources, reducing the bottleneck and dependencies with the SRE team itself.

This is a very ambitious goal and very difficult to achieve, because infrastructure-related teams tend to be overwhelmed by operative tasks day after day. Finding the time and resources to improve others’ autonomy is not straightforward.

Over the last year, we made a huge step towards this autonomy we want to achieve thanks to the Helm Templating initiative.

Helm Templating is an initiative to solve engineering teams requirements using the technology as a mean.

Some context

Most of our infrastructure runs in Kubernetes, and everytime a team needed to deploy any Kubernetes resource, they either needed to deal with all the Helm configuration complexity or to wait in the SRE tasks queue until some SRE engineer was free to help.

Currently, Clarity AI has more than 300 employees. More than 100 are engineers and we manage about 150 services and more than 900 jobs. Each of these services and jobs require Kubernetes configuration and our SRE team is currently formed by 5 people.

5 SRE engineers to provide support for all of this? We clearly needed to put something in place to relieve the toil.

Also, data, frontend and backend engineers shouldn’t need to understand all the Kubernetes nooks to be able to manage their own resources. So providing a solution for the Kubernetes resources management with the least possible cognitive load was a winning card for all the engineers and the company itself.

We had some goals to achieve within the project:

  • We didn’t want to use new complicated tooling
  • We didn’t want to take any irreversible decision
  • We wanted to obtain the bigger benefits with the lesser effort
  • We didn’t want to expose implementation details to the end users
  • We needed to have TESTS, that usually forgotten part

We wanted to provide a solution able to manage all the Kubernetes resources, but also to manage governance of all the resources, AWS resources, activate and configure monitoring, manage routing, etc.

Also, we wanted the solution to add as lesser cognitive load as possible to the end users, so we analysed all of our use cases to be able to standardize most of them and inject those shared configurations into the templatings without any user input.

This is just an example of how some resources were managed before and after the Helm Templating initiative

Before:

We needed to manage different resources and understand complexity for all of them

After:

Just add these lines into the service configuration file

How do teams use Helm Templating?

Helm Templating has been designed to be easy to understand and to reduce cognitive load and toil for the final users, but none of this will work without a proper documentation, so we developed the first SRE User Manual webpage to act as a single source of truth, with implementation examples, starting guide, etc.

This documentation contains the steps to deploy any service using Helm Templating and also how to migrate the existing ones to Helm Templating too.

As a first step and during the Helm Templating development, the SRE team migrated all the services to the new system just to ensure we covered all the use cases we had until the moment. But, right now, all engineers in Clarity AI are using Helm Template to deploy new services without SRE intervention (except in the cases they need resources not currently managed by Helm Templating or we have a new use case).

Implementation and adoption of Helm Templating has been very smooth. All engineers in Clarity AI have received this change happily and have also helped us to solve some issues or to understand some requirements. Of course, this huge change wouldn’t have been possible without all the final users’ support and trust in the project, so a HUGE thanks to all the Clarity AI engineering team.

What did we achieve?

  • Standardization

Now we know all of our services share the same roots and configurations are applied the same way

  • Simplification (got rid of THOUSANDS of configuration lines)

Final users are able to deploy their resources writing a few lines

  • Documentation, a single source of truth

All users can check this documentation to solve any doubt

What about the tests?

Of course we managed to set tests such as:

  • Unit testing
  • Integration testing
  • Canary testing
  • Smoke testing

And now, what?

This project is still alive.

New use cases have to be added, new requirements have to be considered…

Helm Templating has come to stay. It will be great to tell you a story in the future about how we manage ALL use cases using Helm Templating.

--

--