Evolving Self-service application infrastructure at Gympass [part 1]

Published in

Wellhub Tech Team (formerly Gympass)

5 min readAug 26, 2020

Background

Gympass Engineering is based on a microservices mindset supported by several engineering teams across the world. We understand the power of abstractions and invest heavily to create and test the ones that empower engineers to follow the “you build it, you run it” principle. This is a conscious effort and we structured our Production Engineering team within our Product Development organisation to operate in the same manner as other product teams.

Today we’ll go deeper on how our Production Engineering team works to enable teams to be autonomous and accountable by their applications but still having the benefit of sharing common building blocks that form our paved road. Our PE team looks into two main views: self-service platforms and production environments that benefit our end users, partners and clients.

We use Kubernetes and Helm charts to manage our applications. Product engineering teams have full autonomy to work on charts resources, in a self-service way on several parts of application dependencies like ingresses, custom domains, load balancers, persistent volumes, and so on. That was done carefully and enabled a resilient architecture without disrupting engineerings with non standard tools.

Our infrastructure is 100% based on AWS services, running all application services on EKS and several AWS services as application dependencies like RDS, DynamoDB and S3 buckets. But we are growing fast and new services are being created so here is where we get into the trickiest part.

Manual application infrastructure dependency management

Back when we decided to give a big push into automation, we’ve followed a specific pipeline to iterate on those cloud resources, using Infrastructure as a code (IaC) model to operate them with resources definitions done and supported by Terraform.

Product engineering teams could submit pull requests to our infrastructure Terraform repository, and the Production Engineering team had a code review process with safeguards and controls to trigger the “apply” Terraform pipeline, mostly due to the risk to run into inconsistent infrastructure state. We run several environments with specific requirements going from latency to compliance and these processes with safeguards helped engineers to move fast — but not as fast as we envisioned.

Another pain point is secret management, we follow a closed Vault approach so engineers doesn’t have access to production secrets so every time they needed a new secret they had to have to ask the Production Engineering team to proceed the secret creation inside of our Vault and only then they would be able to interact with the new production secrets linking them on k8s deployment descriptor.

Finding a new way

These manual approaches didn’t meet our expectations. The PE team ran surveys and analyzed past requests to understand stakeholders and how to put in place a safe but autonomous model for all engineering teams to operate on their cloud resources needs.

Our principles were the same, so we’ve decided to look into a taylor made platform that would cover them:

Self-service model
Secure by design
Developer experience mindset

Thinking on well known common interfaces

The effort we did to have a strong cloud native foundation using Helm and K8s descriptors to manage our applications paid off. Gympass engineering team is already able to define important application infrastructure behaviors like auto-scaling for instance, they are familiar with how their application runs and scales. Helm’s values.yaml matched our vision for a strong abstraction. Our engineering team is used to interact extensively on regular interfaces like helm values.yaml currently used as application configuration definitions:

resources: 
    limits: 
        cpu: 100m 
        memory: 128Mi 
    requests: 
        cpu: 100m 
        memory: 128Mi 
hpa: 
    enabled: true 
    targetCPU: 80 
    minReplicas: 10 
    maxReplicas: 20

We do have our hands in tools like Kubernetes Operators that can help us to unify how our engineers can interact with their resources using CRD’s (Custom Resource Definition). So we’ve decided to go into a cloud native direction to our application infrastructure management.

Gympass Cloud Native Stack

Gympass Cloud Native Stack was created looking into application services perspective, positioning our product engineering teams as protagonists to operate their application infrastructure dependencies, nonetheless, keeping strong safeguards as all great paved roads must have. Now let’s get into the details about each GCNS component.

AWS Custom Resource Definition (CRD)

Inspired by projects like AWS service operator now replaced by ACK (AWS Controllers for Kubernetes) released recently on preview mode, we've decided to use Kubernetes Operators to manage our AWS resources, which allow us to create a CRD to define new kinds representing a specific resource:

apiVersion: aws.gympass.com/v1alpha1
kind: S3
metadata:
 name: {{ .name }}
spec:
 roleName: {{ .Values.s3.iamRoleName }}
 encryption: {{ .encryption }}
 tags: {{ toYaml .Values.s3.tags}}

The CRD above describe a S3 bucket resource, the aws-operator will watch for those type of CRD's and manage on AWS side, the CRD should be bundled into application Helm chart. Our resources are now automatically created but then we started to struggle of how we can manage bucket permissions for instance, even more, considering we're looking into application perspective we need to consider that the bucket must be accessed by an application. For this purpose we've created external-role controller.

External-role Controller

With IAM roles for service accounts on Amazon EKS clusters, we can associate an IAM role with a Kubernetes service account. The service account will be able to assume the associated role using attached access policy permissions running on containers in any pod that uses this service account, on a Service account resource we just have annotate with associated role:

apiVersion: v1
kind: ServiceAccount
metadata:
    name: <service_account_name>
    annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::<account_id>:role/<role_name>

Using the same approach of Service account annotations we've created a custom annotation that external-role operator will watch and can react managing the specific role that represents the application on AWS.

apiVersion: v1
kind: ServiceAccount
metadata:
   name: <service_account_name>
   annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::<account_id>:role/<role_name>
      external-role.gympass.com/iam-oidc-role: <role_name>

When we have a new Service account being created we can see the external-role controller interaction:

$ kubectl describe <your-service-account> -n <your namespace>…
Events:
 Type Reason Age From Message
 — — — — — — — — — — — — -
 Normal CREATED 15s external-role-controller role custom-role-name

When we have a new service account defined the external-role controller will react and automatically create the custom-role-name on AWS.

CRD & Roles

A CRD like S3 kind can be configured with the application role, internally our aws-operator will attach to the pointed role an inline IAM access policy granting access to operate on the S3 bucket.

Application Infrastructure Configuration

When we put everything to work together and also thinking on a well known common interface we’ve ended up into a regular application values.yaml with a few but important new sections:

resources: 
    limits: 
        cpu: 100m 
        memory: 128Mi 
    requests: 
        cpu: 100m 
        memory: 128Mi 
hpa: 
    enabled: true 
    targetCPU: 80 
    minReplicas: 10 
    maxReplicas: 20serviceAccount:  
    annotations:    
       eks.amazonaws.com/role-arn: arn:aws:iam::<account_id>:role/<app_role>
       external-role.gympass.com/iam-oidc-role: <app_role>s3:
    enabled: true   
    buckets:    
        - name: <bucket_name>
          encryption: false   
    iamRoleName: <app_role>

Using interfaces that the engineers are already comfortable to use remove a lot of friction to introduce new concepts, here can create a simple but powerful application definitions.

In part 2 we'll discuss about how we're approaching our Vault interaction with AWS resources like RDS.