Riskified is a growing startup. In the past 2 years we’ve doubled in size, with 450 employees — of which 60% are researchers and developers. Not only that, new code, new services and new applications are constantly being built and implemented. In order to cope with this growth, we realized we needed to change our infrastructure.
In the past, any change or creation of a service required the attention and resources of our SRE team. On the one hand, handling everything ourselves helped us enforce best practices like monitoring, logging, Docker and deployment process. On the other hand, we couldn’t support the developers fast enough; we had become the bottleneck.
We had to change our infrastructure so that developers could do more without getting stuck waiting for our help. We wanted to give the developers more autonomy in how their code runs in production, while still enforcing best practices.
In this post, I will explain how we use Kubernetes — and a few other tools — to achieve developers’ autonomy.
Autonomy and Safety — Cluster Architecture
As more people in the organization get access to the production environment, the greater the risk that someone might accidentally break something and cause the service to fail, resulting in downtime. To minimize this risk, we decided to set boundaries based on ownership. Only the team that owns a service can change it — If you’re not the one waking up at night when there’s a problem, then you shouldn’t have permission to change the service. In this way, each team is autonomous within its scope, but can’t do any damage to services outside of it.
Our cluster architecture is based on the team ownership concept. Each team has a namespace for its services. The team gets full permission to its namespace and read-only permission to other namespaces. Using namespaces to separate services and permissions requires a different cluster for each environment (production & staging). Each cluster architecture is precisely the same: the same namespaces, same role binding, same services.
A significant advantage of this cluster architecture is that staging and production environments are identical, no magic or special config; If there is a change in the architecture of production, then staging is similarly changed. Another advantage is that spinning up new environments is easy and quick as all of the architecture is already defined.
Enforcing best practices — Helm, Application Chart
Just like developers have code conventions (such as design patterns, code styling), and the tools to enforce them, we too have conventions that we want to follow. When the SRE team had sole access to production, it was easy to follow the conventions. But now that permissions were given to all developers, we needed tools that would help us enforce conventions and help developers set up their services. We chose Helm as a method to both enforce and to help developers with deployment.
At Riskified, we have four main types of services: Ruby on Rails, Rake tasks, Scala web servers and workers, and static web assets. For each main type of service, we created a base chart containing the YAMLs, but left the values blank. In these charts, we can define monitoring, logging, and other practices we want to enforce. Developers create an application chart that requires the base charts and contains the value files. In this way, developers don’t have to know the structure of a deployment YAML, and only need to be familiar with the values that we defined. For ease of use, the application charts should be as minimalist as possible, with the condition that each Docker image be contained in a single chart. This simplifies the deployment process, only requiring an image tag change in one chart.
By using base charts, we can reduce the migration time of an application to Kubernetes to a minimum. Once there is a Docker image, a developer just needs to add the base charts and set the values. For services that have base charts, it takes us only a few hours to fully migrate them.
Application Chart and Git Repository
Where should we put the application chart? There are two leading practices:
- Adding the Helm chart to the Git repository of the code. In this practice, whenever there is a merge into master, a new Docker image is built, and then the tag in the value files should be changed. This contaminates the Git log with several comments about image tag value changes.
- Git repo per application chart. In this practice, for each code repo, a separate Git repo holds the application chart. The disadvantage is that requirement.yaml is shared, but using feature branch, it can be separated when needed. This option felt like over-engineering for our company: it creates many new repos that would need management in terms of permissions.
We decided to take a moderate approach, creating a Git repo for each team. In the Git repo, each application chart has a directory.
Services need different values per environment, so it follows that we need a different value file for every environment. Our developers use feature branch and pull request to master; we didn’t want to change this practice in the Git repo that holds the chart, so we created a value file per environment in the master branch.
Expecting developers to fully understand the ops side isn’t realistic. Instead, we tried to provide them with tools that would shorten their learning curve. For most of our use cases, using our base charts should give the developers everything they need, without learning the spec of the objects. With our help, setting up a service on Kubernetes is the shortest task for the developer. Our goal is by the end of the year not to be involved in this process as more developers become familiar with Kubernetes.