Creating Scalable CTF Infrastructure on Google Cloud Platform with Kubernetes and App Engine

Disclaimer: The infrastructure setup for this CTF took heavy inspiration from this article which really helped us get up to speed with Kubernetes + HAProxy

Preface: Please don’t use this a tutorial, this is more of a story and reflection piece on the crazy week prior to DownUnderCTF.com. I'm sure there are plenty of better ways to do things and I would love to hear your comments!

TL;DR

  • Kubernetes is a scary beast
  • HAProxy is amazing reverse proxy with low latency and low overhead
  • Secure ur cloud challenges properly
  • Prepare your infra > 2 weeks before the CTF
  • Prepare scalable, stable, secure infrastructure.

Running a Capture the Flag competition can be quite a lot more daunting than it initially seems, you obviously have to create challenges with embedded vulnerabilities, but for me as part of the infrastructure team for the DownUnderCTF the difficult part was designing an invulnerable platform for vulnerable challenges 👀AAAAAAA. We needed to design the infrastructure so that it met these core requirements.

  • Scalable — it needed to be handle huge influxes of traffic especially at the start of the CTF ⛳
  • Stable — It needed to not crash and go boom 🧨
  • Secure — The actual infrastructure itself was never meant to be hacked, and having a vulnerability here can definitely be scary (not foreshadowing at all 👀) and potentially could ruin the CTF . Big yikes🔐
  • Cost Effective — We don’t have unlimited money and we wanted to keep this on the cheap and only spend more when we needed to scale up 💰. How very cashmoney of you

So as a first time CTF organiser, this was a bit of 🅱RUH moment, like how do you even start, what do we use, I know! Kubernetes, wait how do you use kubernetes.mp4? what is a workload, how do you even deploy to kube? oh god looks like we are diving in the deep end here. So let’s dive into how we managed to get this up and running while learning how everything worked at the exact same time.

So in terms of infrastructure for this particular CTF we needed 3 main things to be set up: The CTFd platform, the challenges that needed to be hosted (web, pwn, misc, crypto), and for this particular CTF we needed to automatically provision some Cloud Infrastructure as their were 2 Cloud challenges with complex set ups. So ya know just a walk in the park right😂😂😂😂😂?

Now as someone who has had a bit of experience on Cloud before and built plenty of hacky, bodged together projects (mandatory Tom Scott video) with cloud, I was excited to build something that was actually scalable and would cater to a lot of people. So here was the story of what we came up with.

CTFd — The Competition Platform

Having a stable and secure CTF platform is vital to a CTF’s success, if you under provision the resources it runs on, you risk crashing your site and competitors will not able to sign up / submit flags / access the challenges and just overall turns people off the CTF. Big bad😡. Our initial design was essentially to run one VM in Google Compute Engine running our custom docker image of CTFd and pop that behind Cloud Flare, then have a Redis and a Cloud SQL instance on the same VPC for data storage. ez right?

Nope 🚫

During the lead up to the CTF we ran some load tests against this setup using this distributed K8S load testing guide and it simply did not hold up, the website became extremely slow, we were getting lots of 503 errors from CloudFlare and if the server crashed there was no redundancy as we were running this on 1 VM.

hide my load testing pain👀

← Actual photo of me looking at the load testing results

So this just simply wouldn’t do, we weren’t hitting 2 of our requirements of Scalable and Stable infrastructure. We needed to design for SCALE.

Google App Engine to the rescue!

Now I thought to myself, I wonder if there is something that can run a docker image, can scale to meet to demand, and doesn’t have a single point of failure.

Real life depiction of what happened, thanks paint.exe

App Engine Flex! This is a fantastic cloud product where you essentially give it a docker🐋image to run and it listens for HTTP requests and it handles them. Not only that, but it will spin up replicas of the image when traffic is increased(determined by Google magic)! This hits our requirements of a Scalable platform and a Stable one since more replicas = no single point of failure + scalable to the demand we need!

Welcome to world of serverless computing 😍

One problem we ran into was that CTFd listens on port 8000 by default and App Engine requires you to listen on port 8080 but this was easily fixed up in the config of our docker image.

Then we added the App Engine instances to the same VPC of our Redis and Cloud SQL instances and boom it was up and running!

On top of this, a newly released 🅱eta from Google Cloud Platform is setting up serverless Network Endpoint Groups (NEGs). We can then pop a load balancer in front of our App Engine instances and cache static content to make our website even faster!⏩

Bottom one is faster if my picaso art doesn’t show that

Okay now this is epic, we have setup the CTFd infrastructure that will host our CTF. The full config and set up of what we used can be found in our public repo here:

Provisioning Cloud Infrastructure for Cloud Challenges

This part was quite easy as I personally wrote these challenges and was in charge of getting them set up. We had the cloud challenges built out with terraform scripts, meaning that we had the Infrastructure as a Code. So all we had to do was run the terraform scripts to provision all things in order and all was pretty straight forward (this was possible the only stable thing that worked out of the box for setting up this CTF 😂)

  • Scalable? Yep
  • Stable? Cloud is pretty stable yea
  • Cost Effective? yeah kinda
  • Secure? More on this later 😢

Example Challenge with terraform scripts can be found here:

Kubernetes Challenge Cluster

Setting up the challenges was the most challenging part of the CTF infrastructure (ironic). We had quite a variety of challenges that had different technical requirements to both make them exploitable + secure 👀.

So starting with our challenges, all of our challenges that needed to be hosted were dockerised and all of the Challenge authors included a docker-compose.yml file so that you can get it set up in your own environment. Neato thanks authors :)

For the pwn challenges, we had our own custom 🅱ase image of nsjail which allowed the pwn challenges to isolate each individual connection, nice! We 🅱uilt these challenges and pushed the images to Google Container Registry for storage ☁

So we started the deployment of the pwn challenges first as they were usually pretty straight forward and similar. We used this article to help us deploy the first couple of challenges. Firstly, create a challenge cluster with 3 nodes (VM’s in the cluster) like dis

gcloud container clusters create ctf-cluster — zone australia-southeast1-b — machine-type e2-standard-2 — num-nodes 3 — tags challenges

Note that medium.com translates two dashes to one 🅱ig dash so make sure you fix that if you are copy pasta

Okay cool, we have some 🅱M’s in our cluster now, now we need to deploy the challenges. Now wtf the heckin on earth is a kubernetes and how do you use it and how can I 🅱uild a Secure platform if I have no idea how it works. Great question and you will see how that turns out.

paint.exe is a gift given from vincent van gogh himself

So after plenty of google fu and research (this is starting to seem like a CTF challenge itself except there isn’t a flag at the end 😭). I worked out the following:

  • A deployment is a type of workload which is the actual thing that runs your challenge, this can contain 1 or more images where each instance of the challenge is called a replica and is run in a pod
  • A service is a way to expose this workload so other things can access it

Okay cool straight forward enough So here is the deployment.yaml for a challenge that contains both the Deployment and Service which allows connections in on port 30002. This was for the shellthis challenge.

Then a simple

kubectl apply -f deployment.yaml

Will deploy the challenges to the cluster and badabing its working. With this setup we have 3 pods running the same challenge spread across our nodes, for redundancy. Cool! We followed this same pattern for all the other challenges with minor tweaks, eg some web challenges required a SQL database backend so another image was required, others required a seccomp profile, but more or less very similar.

Now how do you actually connect to these challenges? Since right now with our setup we have 3 nodes with 3 IP addresses that may or may not be hosting a replica of each pod of each challenge. It’s time to bring in the load balancing, now in GKE it is possible to give each individual workload it’s own Load Balancer and it will all be automatically set up for you (this is done by changing the service type to LoadBalancer instead of NodePort) however there were two main caveats with this approach.

  1. Having 1 load balancer per challenge is extremely costly and we don’t need to have 20 load balancers running
  2. We wanted the entry for all challenges to be on the same IP address and same domain so in our case chal.duc.tf and then the port would be the guide for which challenge you are trying to connect to and route that way

So we looked into other solutions

HAProxy Load Balancer

Enter HAProxy, HAproxy will act as a reverse proxy in our set up to be the router of requests and load balancer.

So this is what we designed, 1 central VM that will be the intake of all requests to all challenges and will load balance the requests across the nodes. Okay I KNOW what your thinking,

But Sam, doesn’t that just introduce a single point of failure? Doesn’t that mean if the HAProxy server goes down none of the challenges are accessible?

Still waiting for a design company to hire me for my amazing skills

And the answer to your question is YES this does in fact introduce a single point of failure. But the risk here is extremely low. You are almost always going to have problems with infra with challenges, competitors firing exploit scripts that just 🅱low up the containers are definitely a possibility, but breaking the HAProxy where all it does is pass on requests in a round robin fashion to each node. Very low risk.

So again, we used the template given to us from this article where we learnt so much from to get our infra up and running. Please check it out if you want details of implementation. Once it was up and running everything was great and good and all was good in the world. Right? Right????? PLEASEEEE

Nope🚫

At this point the challenges were accessible but we were running into two main problems:

  1. No HTTPS on the web challenges, gotta have that padlock
  2. Stateful challenges (especially web challenges) would break since each time you sent a request to HAProxy it would send you to a different replica pod of the challenge. So for example if you registered an account on a web challenge, it would only exist on 1 of the 3 replicas. 😱

At this point we were on our own, like oh shid, there isn’t any articles on how to do this and other implementations are way to complex and our CTF is in 5 days.

“Oh shid “— Sam Calamos 14th September 2020

The first issue with HTTPS was actually relatively easy to solve, there is an article on HAProxy on how to get TLS up and running and how to terminate TLS at the Load balancer and then forward on requests to the nodes. We just had to add an X-Forwarded-Proto header in the request so that our web challenges knew to redirect to HTTPS rather than just HTTP and we had the padlock 🔏. 1 Victory Royale here

Now for number 2 you could just have 1 replica of the challenge running but hey that's not SCALABLE nor STABLE nor SAFE nor any good at all 😢. A second issue of solving this is that there is actually two levels of load balancing going on. First at the proxy level with HAProxy distributing requests in a round robin fashion AND kubernetes distributing requests to different replica pods depending if they were overloaded or not.

My single brain cell trying to work this out

Solving the HAProxy balancing issue actually wasn’t that huge of a deal, we firstly set up sticky sessions 🅱ased on IP. This would essentially send the same IP connecting to the same node every time. Below is a set up of how we did that, with an example web challenge of circlespace.

However even though the same requests from the same IP would go to the same node that doesn’t mean it will always go to the same pod since Kubernetes can load 🅱alance itself and can have more than 1 of the same pod on the same node.

Then someone from the infra team came up with a genius idea, “Why don’t we make the web challenges a DaemonSet👿?” I was super excited and like hell yeah! But also, wtf is a DaemonSet, I still don’t know the big kube very well. Turns out a DaemonSet is a type of workload that deploys your images with exactly 1 pod per node ALWAYS. Meaning that if you scale up your nodes then your pods will scale up exactly the same amount!

Very Cool!

It was almost perfect! Our requests were all routed to the same node and we weren’t getting any weird stateful replica issues.

Until

We did a

for i in {1..1000}; do curl <challenge_endpoint>; done

This will make 1000 requests to our endpoint and what we noticed was that after sometime we would be routed to another pod, on another node?!?!?!??!??!?!??!?!?

🅱 R U H

Turns out kubernetes still loves to reroute requests if a pod is overloaded (I mean good on u but also, u suck). So again a bit of Google Fu found that we can set the ExternalTrafficPolicy: Local which will force the pod to handle the request.

And with that our Kubernetes X Load balancer set up was done!

Ahh but with all successes comes another challenge, ah yes here we go. We are now 2 days out until the CTF begins.

CPUs Don’t grow on trees smh 👿

Now as a fresh young cloud organisation such as downunderctf.com on Google Cloud running their CTF of their free trial credits, Google Cloud isn’t exactly that forthcoming with the CPU’s they allow you to use. In fact when on a trial account they only allow you to have 8 CPU’s as a quota.

So some basic maths of the infrastructure we had set up so far:

  • 3 * 2 CPUs for the Kubernetes Node cluster
  • 1 CPU for the HAProxy load balancer
  • 1 CPU for the App Engine Flex instance

carry the 4, divide by 6 … oh shid we have already hit our limit of CPUs and we haven’t even scaled up yet and this is what our sign up stats looked like for the competition.

We’re gonna need more power, I don’t know how fast I hit the Upgrade button to become a paid account, but it was dang fast.

So as soon as I upgraded to a paid account I though the quota would be lifted, nope 🚫

Now luckily enough I happen to work for Google! So I internally contacted about 7 people from offices in Taiwan, Kuala Lumpur, Dublin, Indonesia, Singapore and Australia to get this Quota Increase Request fast tracked.

Now this was all jolly and all and it took around 16 hours to get this increase request fulfilled. So Kudos to the Google Cloud sales team, you’re a life saver.

However we were now T-9 hours until our CTF started, holyheckers.jpg. Upgrade upgrade upgrade upgrade, we pumped our node pool up to 5 nodes, our App Engine instance was pumped up and we have up to 20 replicas when needed. And we were ready to go!

Chapter 2: Secure Infrastructure was not secure

It hit 7pm AEST on that warm cosy Friday afternoon and the competitors were off. Not enough testing on some of the challenges required some redeploys, Firebase challenges ran out of quota required another UPGRADE (insert upgrade.png pic here). But other than that the start of the CTF began quite uneventful!

It was all good until we received this message from one of our event organisers the following day after a very long night of watching the CTF infrastructure very closely.

You ever read a message, and go EXCUSE ME. LIKE WHAT and then we received this Twitter DM:

Helloo folks. I exploited the ‘addition’ challenge and was able to read /var/run/secrets/kubernetes.io/serviceaccount/token using the following payload

<snip></snip>

Have you checked to see if your infrastructure is vulnerable via this attack playbook? If so, your flagz and other things may be at risk.
https://hackernoon.com/capturing-all-the-flags-in-bsidessf-ctf-by-pwning-our-infrastructure-3570b99b4dd0

Running on 4 hours of sleep, barely able to comprehend my own implementation of kubernetes this was like 🅱ruh. What is about to happen, are we about to get rekt, is our kube implementation going to explode, this person thankfully came forward and disclosed this but what if someone else hasn’t and it exploiting the CTF. ❗❗❗❗

I swear I head this noise when I read that message

So a quick google fu with about 1/4 of a brain cell left functioning we found out that thankfully GKE implements RBAC by default on kubernetes service accounts and upon checking the permissions of these service accounts they could do NOTHIN

Service account couldn’t do anything

BIG

SIGH

OF

RELIEF

😌😌😌

We still removed the ability to get the token since we didn’t fully understand the implications of what anyone could do but we were sure it wasn’t ‘that’ bad. We did this through putting this in our deployment.yaml files

enableServiceLinks: false      
automountServiceAccountToken: false

and that was that.

We also had an exploit in one of our Cloud Challenges that could have allowed the competitors to completely take over the GCP project it was running in, but hey that was discovered by me after chatting with the solver and very quickly patched up! Details on their writeup and how they did that are here.

Closing Thoughts

You want to learn how to build a resilient, load bearing, secure, stable infrastructure stack? 🅱uild a CTF! I honestly learnt so much putting this all together and can’t thank the DownUnderCTF team that supported each other in running such an amazing event. I learnt a huge a amount and can’t wait for next years event 😍😍!

If you do have any thoughts on how we could improve our set up please let me know in the comments I would love to have a chat ☕.