Scaling workloads with the big savings quartet: EKS, Fargate, Karpenter and Keda

16 min readFeb 10, 2024

Introduction

The guinea pig chosen for this project is Metabase, an open source tool used when some corporate guys have some big questions to make about their companies. I had never worked with it and that’s why it’s a good choice. It’s written in Clojure and JavaScript, needs a database and consumes lots of CPU and memory.

This endeavor involves a blend of Terraform, EKS, Fargate and Karpenter, integrated with Istio for service mesh, Prometheus, Grafana, and Keda for monitoring and auto-scaling workloads based on traffic metrics.

All code and documentation can be found here.

This is what we’re aiming for:

These are all the tools used:

Terraform: For infrastructure as code.
EKS: For managed Kubernetes on AWS.
Fargate: For serverless compute on Kubernetes.
Karpenter: For autoscaling Kubernetes clusters.
Metabase: For business intelligence and analytics.
GitHub Actions: For CI/CD.
Docker: For local development and testing.
Istio: For service mesh and rich traffic metrics.
Prometheus and Grafana: For metrics scraping and observability.
Keda: For scaling Metabase based on istio_requests_total and memory usage.

Here is the project tree overview:

.
|-- .github
|   |-- workflows
|   |   |-- apply-all.yaml
|   |   |-- apply-workflow.yaml
|   |   |-- destroy-workflow.yaml
|   |   |-- manual-apply.yaml
|   |   |-- plan-workflow.yaml
|   |   |-- stack-workflow.yaml
|   |   `-- uninstall-workflow.yaml
|-- README.md
|-- environments
|   |-- dev
|   `-- lab
|       |-- backend.tf
|       |-- main.tf
|       |-- outputs.tf
|       |-- providers.tf
|       |-- s3-dynamodb
|       |   `-- main.tf
|       `-- variables.tf
|-- infra
|   |-- backend
|   |   |-- main.tf
|   |   |-- outputs.tf
|   |   `-- variables.tf
|   |-- eks-fargate-karpenter
|   |   |-- main.tf
|   |   |-- outputs.tf
|   |   `-- variables.tf
|   |-- rds
|   |   |-- main.tf
|   |   |-- outputs.tf
|   |   `-- variables.tf
|   `-- vpc
|       |-- main.tf
|       |-- outputs.tf
|       `-- variables.tf
|-- stack
|   |-- istio
|   |   |-- istio-ingress.yaml
|   |   |-- istiod-values.yaml
|   |   |-- pod-monitor.yaml
|   |   `-- service-monitor.yaml
|   |-- keda
|   |   `-- values.yaml
|   |-- metabase
|   |   |-- metabase-hpa.yaml
|   |   |-- metabase-scaling-dashboard.yaml
|   |   `-- values.yaml
|   `-- monitoring
|       `-- values.yaml

Step 1: Local environment

We kick off by crafting an environment tailored for the task. This setup includes most essential tools and configurations needed for infrastructure management and development. I recommend using WSL, if you’re on Windows, but it should also work for macOS.

Just clone this devenv repo and make sure you give it the right permissions:

chmod +x *.sh

Then, run it from the main file:

./main.sh

If you want to install the pre-requisites by yourself, here they are:

Terraform CLI
AWS CLI
kubectl
kubectx

Step 2: AWS credentials and Terraform Backend on S3 and GitHub Workflows

Now, you have to run aws configure and provide your AWS access key ID, secret access key, default region, and output format. The same information should be stored on your GitHub repository secrets before we get to the CI/CD bit of this article. To learn how you can create your AWS credentials, follow the instructions here. For GitHub secrets, follow these instructions.

All workflows are configured to use these credentials:

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: 'us-east-1'

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v1
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    aws-region: ${{ env.AWS_DEFAULT_REGION }}

With the local environment and credentials ready, it’s time to cd into(or onto?) the environments/lab/s3-dynamodb folder where a main.tf is located.

Here, we create an S3 bucket and a DynamoDB table, to serve as a Terraform backend:

module "backend" {
  source              = "../../../infra/backend"
  region              = "us-east-1"
  bucket_name         = "tfstate-somename-lab"
  dynamodb_table_name = "tfstate-somename-lab-lock"
}

Run terraform init , then terraform apply , and never turn back.

Finally, cd back to the environments/lab folder and run terraform init .

Because of the backend.tf file located in the same dir, you should see something like this:

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Checking for available provider plugins...
- Downloading plugin for provider "aws" (hashicorp/aws)...

Terraform has been successfully initialized!

That’s it.

You’ll have your infrastructure state stored on AWS S3 from now on.

GitHub Workflows Configuration

The following workflows were created to automate the CI/CD process:

plan-workflow.yaml: Runs terraform plan to review the changes that will be applied to the infrastructure. It triggers on pull requests to the main branch.
apply-workflow.yaml: Runs terraform apply to apply the changes to the infrastructure in a specific order using the target parameter. It triggers on pushes to the main branch. To avoid triggering this workflow, like when only documentation changes are made, add [skip ci] to the commit message.
destroy-workflow.yaml: Runs terraform destroy to destroy the infrastructure. It can only be triggered manually from the GitHub Actions page.
stack-workflow.yaml: Runs helm upgrade --install or helm uninstall on all the extra Kubernetes components as well as Metabase. It can only be triggered manually from the GitHub Actions page on the cluster-stack or main branches. You can also select which addon or workload to upgrade or uninstall by adding their names as inputs to the workflow:

For quick tests, the best option is to push changes to the cluster-stack branch and trigger this workflow manually. This way, you can test the changes without affecting the main branch.

There is also an apply-all.yaml workflow, which applies both the AWS infrastructure and the cluster stack.

All these workflows can be found on the .github/workflows folder.

Step 3: AWS Core Infrastructure with VPC, EKS, Karpenter and Fargate

After getting the local environment ready, setting up a CI/CD pipeline with Github Actions, and configuring an S3 backend, let’s turn the attention to the core AWS infrastructure.

Creating a VPC:

The first order of business was establishing a VPC, which I named lab-vpc. This was to ensure I had an isolated network space for my lab setup. I made sure the VPC spanned across three Availability Zones to enhance availability and fault tolerance.

Here’s how I laid out the VPC:

module "lab_vpc" {
  source = "../../infra/vpc"

  name            = local.name
  vpc_cidr        = local.vpc_cidr
  azs             = local.azs
  private_subnets = ["10.0.0.0/19", "10.0.32.0/19", "10.0.128.0/19"]
  public_subnets  = ["10.0.64.0/19", "10.0.96.0/19", "10.0.160.0/19"]
  intra_subnets   = ["10.0.192.0/19", "10.0.224.0/19"]

  tags = local.tags
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.0"

  name = var.name
  cidr = var.vpc_cidr

  azs             = var.azs
  private_subnets = var.private_subnets
  public_subnets  = var.public_subnets
  intra_subnets   = var.intra_subnets

  enable_nat_gateway = true
  single_nat_gateway = true

  public_subnet_tags = {
    "kubernetes.io/role/elb"            = "1"
    "kubernetes.io/cluster/${var.name}" = "shared"
    "profile"                           = "public"
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb"   = "1"
    "kubernetes.io/cluster/${var.name}" = "shared"
    # Tags subnets for Karpenter auto-discovery
    "karpenter.sh/discovery" = var.name
    "profile"                = "private"
  }

  tags = var.tags
}

EKS Cluster Setup with Fargate and Karpenter:

Next on my list was setting up an EKS cluster named metabaselab.I chose Fargate for its serverless compute capabilities, perfect for running pods that don’t need the full capacity of an EC2 instance. Karpenter was deployed on Fargate nodes to handle autoscaling. It is advised that Karpenter is not run on the same nodes it manages. Fargate will provide fast autoscaling when needed as well as a 99.99% uptime guaranteed.

Here’s how it was done:

module "eks_fargate_karpenter" {
  source = "../../infra/eks-fargate-karpenter"

  cluster_name             = "metabaselab"
  cluster_version          = "1.28"
  vpc_id                   = module.lab_vpc.vpc_id
  subnet_ids               = module.lab_vpc.private_subnets
  control_plane_subnet_ids = module.lab_vpc.intra_subnets

  fargate_profiles = {
    karpenter = {
      selectors = [
        { namespace = "karpenter" }
      ]
    }
    kube-system = {
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }
}

It will apply all resources on source = “../../infra/eks-fargate-karpenter” which, among other resources, has a NodePool defined for Karpenter to guide its decisions:

resource "kubectl_manifest" "karpenter_node_pool" {
  yaml_body = <<-YAML
    apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      name: default
    spec:
      template:
        spec:
          nodeClassRef:
            name: default
          requirements:
            - key: karpenter.sh/capacity-type
              operator: In
              values: ["on-demand"]
            - key: "node.kubernetes.io/instance-type"
              operator: In
              values: ["t2.micro", "t3.micro", "t3.small"]
      # Resource limits constrain the total size of the cluster.
      # Limits prevent Karpenter from creating new instances once the limit is exceeded.
      limits:
        cpu: 56
        memory: 56Gi
      disruption:
        consolidationPolicy: WhenEmpty
        consolidateAfter: 30s
      kubelet:
        maxPods: 100
  YAML

  depends_on = [
    kubectl_manifest.karpenter_node_class
  ]
}

To summarize, it tells Karpenter to only provision t2.micro, t3.micro and t3.small instance types. In a multi-tenant environment, you could have multiple NodePools for different scaling needs, excellent for internal developer platforms.

In case you want to run another workload on Fargate, you just need to create another Fargate profile:

fargate_profiles = {
    karpenter = {
      selectors = [
        { namespace = "karpenter" }
      ]
    }
    kube-system = {
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }

Demonstrating Karpenter’s Scaling Abilities:

With the eks-fargate-karpenter module, goes an example deployment with zero replicas just to test Karpenter:

resource "kubectl_manifest" "karpenter_example_deployment" {
  yaml_body = <<-YAML
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: inflate
    spec:
      replicas: 0
      selector:
        matchLabels:
          app: inflate
      template:
        metadata:
          labels:
            app: inflate
        spec:
          terminationGracePeriodSeconds: 0
          containers:
            - name: inflate
              image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
              resources:
                requests:
                  cpu: 1
  YAML

  depends_on = [
    helm_release.karpenter
  ]
}

To put Karpenter to the test, simply run kubectl scale deployment inflate --replicas=10. You should see some nodes start popping up:

$ kubectl get nodes

NAME                                       STATUS   ROLES    AGE     VERSION
ip-192-168-1-100.ec2.internal              Ready    <none>   4m      v1.28.0
ip-192-168-1-101.ec2.internal              Ready    <none>   3m      v1.28.0
fargate-ip-192-168-1-200.fargate.internal  Ready    <none>   2m30s   v1.28.0
fargate-ip-192-168-1-201.fargate.internal  Ready    <none>   2m      v1.28.0
fargate-ip-192-168-1-202.fargate.internal  Ready    <none>   1m30s   v1.28.0
fargate-ip-192-168-1-203.fargate.internal  Ready    <none>   1m      v1.28.0

Workloads running on regular EC2 instances. kube-system and karpenter, on Fargate nodes.

Step 4: A decoupled database with RDS for Metabase

So, the next thing I did was get an RDS instance up and running for the Metabase database. This step is pretty key because it’s where the data lives and the lifecycle of Metabase should be independent of the database it uses(best practices). I used the infra/rds module for this. Here’s how I set it up:

module "lab_rds" {
  source = "../../infra/rds"

  db_name     = local.name
  db_username = local.name
  db_port     = 3306
  db_password = var.db_password

  vpc_security_group_ids = [module.security_group.security_group_id, module.eks_fargate_karpenter.cluster_primary_security_group_id]
  subnet_ids             = module.lab_vpc.private_subnets

  tags = local.tags
}

I added a security group. This basically controls who can talk to the RDS instance:

module "security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "~> 5.0"

  name   = local.name
  vpc_id = module.lab_vpc.vpc_id

  ingress_with_source_security_group_id = [
    {
      source_security_group_id = module.eks_fargate_karpenter.cluster_primary_security_group_id
      from_port                = 3306
      to_port                  = 3306
      protocol                 = "tcp"
      description              = "MySQL access from within VPC"
    },
  ]

  tags = local.tags
}

This setup places the RDS instance right in the same VPC as the EKS cluster, tucked away in the private subnets for extra security. The security group I set up ensures that only traffic on port 3306 (the MySQL port) can get through, which is just what we need.

I kept the RDS setup separate from the Metabase deployment. This gives me the flexibility to switch databases for Metabase if needed, just by updating the database connection string in the Metabase setup.

Step 5: Rolling Out the AWS Load Balancer Controller for service exposure

Next up was deploying the AWS Load Balancer Controller, which is pretty much the gatekeeper for letting traffic from the outside world chat with services inside the cluster. I wove this into the main Terraform setup like so:

data "http" "iam_policy" {
  url = "https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json"
}

resource "aws_iam_policy" "load_balancer_controller" {
  name        = "AWSLoadBalancerControllerIAMPolicy"
  description = "Policy for the AWS Load Balancer Controller"
  policy      = data.http.iam_policy.body
}

resource "aws_iam_role" "load_balancer_controller_role" {
  name = "eks-load-balancer-controller-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRoleWithWebIdentity",
        Effect = "Allow",
        Principal = {
          Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/oidc.eks.${local.region}.amazonaws.com/id/${local.oidc_id}"
        },
        Condition = {
          StringEquals = {
            "${module.eks_fargate_karpenter.oidc_provider.oidc_provider}:sub" : "system:serviceaccount:kube-system:aws-load-balancer-controller",
            "${module.eks_fargate_karpenter.oidc_provider.oidc_provider}:aud" : "sts.amazonaws.com"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "load_balancer_controller_policy_attach" {
  role       = aws_iam_role.load_balancer_controller_role.name
  policy_arn = aws_iam_policy.load_balancer_controller.arn
}

resource "kubernetes_service_account" "load_balancer_controller" {
  metadata {
    name      = "aws-load-balancer-controller"
    namespace = "kube-system"
    annotations = {
      "eks.amazonaws.com/role-arn" = aws_iam_role.load_balancer_controller_role.arn
    }
  }
}

resource "helm_release" "aws_load_balancer_controller" {
  name       = "aws-load-balancer-controller"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"

  set {
    name  = "clusterName"
    value = local.name
  }

  set {
    name  = "region"
    value = local.region
  }

  set {
    name  = "serviceAccount.create"
    value = "false"
  }

  set {
    name  = "serviceAccount.name"
    value = "aws-load-balancer-controller"
  }

  set {
    name  = "vpcId"
    value = module.lab_vpc.vpc_id
  }

  timeout = 3600
  wait    = true

  depends_on = [aws_iam_role.load_balancer_controller_role, kubernetes_service_account.load_balancer_controller]
}

With this controller in place, there are a couple of ways to open up the doors for traffic to flow into the cluster:

Option 1: Going with Istio Ingress Gateway. This route lets you tap into Istio’s bag of tricks — think traffic management, bolstered security, and the ability to keep a keen eye on everything. You’d set up a VirtualService and a Gateway for each service you’re looking to expose. Going this route, you’d typically need a domain name. With a domain, you can create more user-friendly URLs for your services. Instead of relying on automatically generated or IP-based addresses, you can have URLs like service.yourdomain.com which are easier to remember and share.

Option 2: Direct Kubernetes Services of Type LoadBalancer. This is more straight-up, where you create Kubernetes services set to LoadBalancer type, and AWS automatically conjures up a load balancer (Classic, ALB, or NLB, depending on your annotations) for each service. Just set the service type to LoadBalancer and tag it with the right annotations to get the AWS Load Balancer to behave exactly how you need.

I chose option 2.

This method is super straightforward, perfect for when you don’t need the fancy routing and management that Istio offers. It’s ideal for simpler setups where services get their own dedicated load balancer for a direct line to the outside world. Keep in mind that this method will expose your service to the internet.

Step 6: Configuring Metabase Deployment

Here’s the setup I used to deploy Metabase:

database:
  type: mysql
  port: 3306
  dbname: metabaselab
  username: metabaselab

service:
  name: metabase
  type: LoadBalancer
  externalPort: 80
  internalPort: 3000
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"

resources:
  requests:
    cpu: 1000m
    memory: 800Mi
  limits:
    cpu: 1250m
    memory: 1Gi

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
  port: 9191

strategy:
  type: Recreate

Database & Service Specs:

Database Configuration: Hooks up Metabase to a MySQL database, with details like host, port, database name, and username.
Service & Resources: Sets up a LoadBalancer service for public access, alongside resource allocations for smooth operation within Kubernetes based on Metabase’s guidelines for memory and scaling.

Scaling with KEDA

For scaling Metabase dynamically, I leaned on KEDA. The ScaledObject configuration looked something like this:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: metabase
  namespace: metabase
spec:
  scaleTargetRef:
    kind: Deployment
    name: metabase
  minReplicaCount: 1
  maxReplicaCount: 10
  cooldownPeriod: 30
  pollingInterval: 1 
  fallback:
    failureThreshold: 1
    replicas: 1
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring:9090
      metricName: requests_per_second
      # (rate) will give us the number of requests per second within a 2 minute window. sum() will add add data from all pods.
      query: |
        sum(rate(istio_requests_total{destination_workload="metabase"}[2m]))
      threshold: "100"
  - type: memory
    metricType: Utilization
    metadata:
      value: "110"

This setup allows Metabase to flex its muscles based on the crowd it’s handling (i.e., requests per second) and how heavy the workload is (memory usage). It’s pretty smart about it, too, with a setup that scales from 1 to 10 replicas as needed, checks in every second, and cools down for 30 seconds after scaling actions.

The target memory utilization is set to 150%. This is because HPAs are based on resource requests, not limits, to determine when to scale by percent. So, my Metabase deployment will scale out in two scenarios:

The number of istio_requests_total per second of all Metabase pods reach 100
The memory utilization reach 880Mi.

After deploying the ScaledObject(next step), you can check the HPA created by Keda:

sre@99326d08570e:~$ kubectl get hpa -n metabase
NAME                REFERENCE             TARGETS                       MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-metabase   Deployment/metabase   107143m/100 (avg), 75%/110%   1         10        7          88m

When the workload allows it, you can set Keda to scale in to zero replicas for even bigger savings.

Step 7: Rolling Out the Cluster Stack: The Essentials + Metabase

Alright, this step is where things really start to take shape for making Metabase smart enough to scale based on how busy it gets and how much it’s thinking (a.k.a. requests per second and memory usage). Not to mention, this is also where we dive deep into the monitoring and observability pool. I laid it all out using a stack workflow that installs all the stack in the correct order of dependencies:

name: Cluster Stack Management

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: 'us-east-1'
  EKS_CLUSTER_NAME: 'metabaselab'
  RDS_PASSWORD: ${{ secrets.RDS_PASSWORD }}

on:
  workflow_dispatch:
    inputs:
      components:
        description: 'Comma-separated list of components to apply (e.g., istio,metabase)'
        required: true
        default: 'keda,metrics-server,monitoring,metabase,istio'

jobs:
  helm:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v2

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_DEFAULT_REGION }}

      - name: Update kube config
        run: aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $AWS_DEFAULT_REGION

      - name: Set up Helm
        uses: azure/setup-helm@v1
        with:
          version: 'v3.13.3'

      - name: Install metrics-server
        if: contains(github.event.inputs.components, 'metrics-server')
        run: |
          kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml

      - name: Helm install kube-prometheus-stack(monitoring)
        if: contains(github.event.inputs.components, 'monitoring')
        run: |
          helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
          helm repo update
          cd stack/monitoring
          helm upgrade --install monitoring prometheus-community/kube-prometheus-stack --namespace monitoring -f values.yaml --create-namespace

      - name: Helm install Istio
        if: contains(github.event.inputs.components, 'istio')
        run: |
          helm repo add istio https://istio-release.storage.googleapis.com/charts
          helm repo update
          cd stack/istio
          helm upgrade --install istio-base istio/base -n istio-system --create-namespace --set defaultRevision=default
          helm upgrade --install istiod istio/istiod -n istio-system -f istiod-values.yaml --wait
          kubectl apply -f pod-monitor.yaml && kubectl apply -f service-monitor.yaml
          helm upgrade --install istio-ingressgateway istio/gateway -n istio-system -f istio-ingress.yaml --wait

      - name: Helm install KEDA
        if: contains(github.event.inputs.components, 'keda')
        run: |
          helm repo add kedacore https://kedacore.github.io/charts
          helm repo update
          cd stack/keda
          helm upgrade --install keda kedacore/keda --namespace keda -f values.yaml --create-namespace

      - name: Fetch RDS Endpoint for Metabase
        if: contains(github.event.inputs.components, 'metabase')
        run: |
          RDS_ENDPOINT=$(aws rds describe-db-instances --db-instance-identifier metabaselab --query 'DBInstances[0].Endpoint.Address' --output text)
          echo "RDS_ENDPOINT=$RDS_ENDPOINT" >> $GITHUB_ENV

      - name: Helm install Metabase
        if: contains(github.event.inputs.components, 'metabase')
        run: |
          if ! kubectl get namespace metabase; then
            kubectl create namespace metabase
            kubectl label namespace metabase istio-injection=enabled
          fi

          helm repo add pmint93 https://pmint93.github.io/helm-charts
          helm repo update
          cd stack/metabase
          helm upgrade --install metabase pmint93/metabase --namespace metabase -f values.yaml --create-namespace \
            --set database.host="$RDS_ENDPOINT" \
            --set database.password="${{ secrets.RDS_PASSWORD }}"
          kubectl apply -f metabase-hpa.yaml && kubectl apply -f metabase-scaling-dashboard.yaml

This workflow is only triggered manually. The best way to work with this workflow is to keep a separate branch just to update and run it.

All components installed in this step serve a different purpose:

Metrics Server: This step automates bringing the Metrics Server online within Kubernetes. It’s a critical component that makes autoscaling (via HPA) and tracking resource consumption possible.
kube-prometheus-stack: This part ensures the kube-prometheus-stack is deployed seamlessly, weaving Prometheus into our monitoring fabric and Grafana for insightful dashboards. Prometheus metrics will be used by Keda HPA to scale based on requests per second and by Grafana to feed the Keda dashboard we are going to use to monitor Metabase scaling.
Istio: We roll out Istio across the cluster. The setup includes Istio’s foundation, its control plane (Istiod), and an Ingress Gateway for managing incoming traffic. Istio is like a Swiss Army Knife for traffic management, but we’re going to use it mostly to provide rich traffic metrics to Prometheus.
Istio Ingress Gateway: Positioned as the gatekeeper for external service access, it’s on standby for whenever you decide to route traffic through it. Remember option 1 discussed earlier?
Istio PodMonitor and ServiceMonitor: These components keep an eye on Istio’s own components and any service running with Istio sidecars, like Metabase. They’re vital for pulling traffic metrics, which in turn, enable dynamic scaling based on the load.
KEDA Installation: This step brings KEDA into the fold, setting the stage for event-driven scaling within our Kubernetes landscape.
KEDA Dashboard: A dedicated dashboard for monitoring KEDA’s ScaledObjects and HPAs, making it easier to keep tabs on how Metabase scales in response to events.
Metabase Deployment: Here, we deploy Metabase onto our EKS cluster, using Helm for a smooth setup. The Metabase chart is tweaked to connect with our RDS database and to snugly fit within Istio’s service mesh using istio injection.
Metabase HPA: We introduce a Horizontal Pod Autoscaler (HPA) specifically for Metabase, which is created based on Keda’s ScaledObject. This enables our setup to automatically adjust Metabase’s resource allocation based on real-time traffic and memory demand.

Step 8: Accessing Metabase, Grafana, and Prometheus

When it comes to checking out the user interfaces for Metabase, Grafana, and Prometheus, here’s how you can get connected:

Get their service’s IP address:

kubectl get svc -A
NAMESPACE      NAME                                                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                      AGE
metabase       metabase                                             LoadBalancer   172.20.64.5      xxxxx.elb.us-east-1.amazonaws.com   80:30961/TCP,9191:30516/TCP                  89m
monitoring     monitoring-grafana                                   LoadBalancer   172.20.138.142   xxxxx.elb.us-east-1.amazonaws.com   80:31815/TCP                                 90m
monitoring     monitoring-kube-prometheus-prometheus                LoadBalancer   172.20.160.99    xxxxx.elb.us-east-1.amazonaws.com   9090:32740/TCP,8080:30352/TCP                90m

2. You can access the services by appending the appropriate path to this IP address in your browser. For example:

Metabase: xxxxx.elb.us-east-1.amazonaws.com
Grafana: xxxxx.elb.us-east-1.amazonaws.com
Prometheus: xxxxx.elb.us-east-1.amazonaws.com:9090/graph

As soon as users start accessing Metabase, you can see it scale out on Keda dashboard:

Wrapping It Up

To wrap up the exploration into scaling workloads with the big savings quartet of EKS, Fargate, Karpenter, and Keda, it’s crucial to emphasize that achieving significant cloud savings requires a harmonious alignment of these technologies. The project, which focused on leveraging Metabase as a test case, has illustrated the dynamic and flexible scaling capabilities possible when integrating these AWS and Kubernetes services effectively.

The journey began with a setup that included EKS for managing Kubernetes clusters on AWS, Fargate for serverless compute power, Karpenter for efficient and responsive autoscaling, and Keda for event-driven scaling based on actual workload demands. Each component plays a vital role in the ecosystem, contributing to both performance efficiency and cost optimization.

For organizations looking to optimize their cloud expenses while maintaining or enhancing service performance, the combination of these tools offers a compelling solution. By leveraging EKS and Fargate, you can manage your Kubernetes workloads with less overhead, allowing for a more streamlined operation. Karpenter further enhances this by intelligently managing resource allocation, ensuring that you’re only using what you need, when you need it. Finally, Keda closes the loop by enabling application scaling based on real-world events and metrics, ensuring that resources are precisely aligned with demand.

However, the key to unlocking these savings lies not just in the adoption of these technologies but in their strategic integration and alignment. It requires a thoughtful approach to architecture, a deep understanding of workload patterns, and a commitment to continuous optimization. When EKS, Fargate, Karpenter, and Keda are used in concert, with each part of the quartet playing to its strengths, organizations can achieve a scalable, efficient, and cost-effective cloud environment.

When it comes down to saving bucks in the cloud, the real game-changer is right-sizing. It’s all about getting the fit just right — making sure you have exactly what you need, no more, no less, to keep things running smoothly without overspending.