Site Reliability Engineers in a nutshell

Published in

Clarity AI Tech

4 min readMar 8, 2023

As part of the Platform/SRE team, during the last two years, we’ve been focused on improving our services while Clarity AI was constantly scaling up. At the end of 2020 there were about 70 of us and today (in March 2023) we are over 300 employees. It’s not a surprise that this growth is a challenge for everyone, especially for the Platform/SRE team that has to deal with the increasing number of engineers working at the same time while scalability, security and reliability of all our services still have to be in the center. So, what have we done in order to keep this Formula 1 car going while improving, maintaining and fixing it on the go and how did we did it?

Let us wrap up the changes we have done over the latest 2'5 years:

In the first place its important to mention that always, when possible, we prefer to use managed services. This way, we can focus our efforts on improving other areas and the team is more efficient. We have migrated to managed services for kubernetes, databases, observability…

A governance campaign has taken place and we have been able to identify not used resources or we have been able to optimize the use of others. In addition, we have been applying some TTLs in several buckets, making the search processes more efficient and saving some money on the same move.

As part of the governance initiative, we also wrapped all of our AWS accounts into an Organization, where we have a simple and centralized place to apply some global policies to ensure all the accounts benefit from them. Having an AWS Organization in place, with different AWS accounts divided by purpose and following the Best Practices of AWS, allows us to have a better visibility and control of our infrastructure so we can make decisions with all the information needed.

Regarding some internal processes, we have implemented a way to check and create MRs in our repositories whenever a new version of some tools or resources we are using is available. With this, the maintenance work is more affordable and we ensure we apply patches and upgrades often.

However our biggest achievement regarding upgrades is that right now we are able to perform some big upgrades without downtime so they don’t impact the company. We don’t have zero downtime for all of our operations but we don’t have a lot of tools remaining in the list so we hope we can reach it soon.

Of course not everything is about new functionalities or tools. We have been working on some debt we had on our code and systems and they have improved substantially, although there are a lot of things remaining in our to-do list. For instance, we needed to increase our networking bandwidth to be able to provide IPs to the CI/CD pipelines that were getting stuck with the company’s growth. That meant that we had to migrate a whole production kubernetes cluster in order to be able to grow and have more people working at the same time.

Regarding security, one of our main goals was to stop using an old VPN we had in place. First of all, we migrated to Vault but right now we are using strongDM to create a zero-trust environment with an accessbot where everyone can get temporary access to our databases specifying a reason and some admins can provide it or not.

We also have been implementing integrations with security tools to analyze and detect malware and vulnerabilities in our systems. On the other hand, we have been reducing the openness of some roles we were using for meat users and also have started using IRSA for the services.

Regarding other operatives, we have introduced ArgoWorkflows to allow cron jobs to work in a straightforward way.

Overall, we have been talking and cooperating a lot with all of the developers in the company, trying to understand their pains and willing to help them and make their work easier.

Our brand new initiative is the Helm templating, where we allow developers to deploy services with the minimum amount of lines of yaml configuration and reducing their toil and cognitive load. Next step is to develop a way for them to be able to create and maintain their own secrets and configurations outside of the Platform/SRE team, which is now a manual process that leads to errors.

Keep posted because we will talk about our Helm template initiative soon in a separate post.

Site Reliability Engineers in a nutshell

Written by Silvia Cobo