How to run a CTF that survives the first 5 minutes
Ah yes, another year another DownUnderCTF and the sequel is always better right? Well for this year we wanted bigger and better and so for the Infra team, this meant creating a better Infrastructure platform for our amazing CTF authors and players. We wanted the sequel to go like Shrek 2 did but not Shrek 3 or god forbid Shrek 4.
Now at this point you’re probably thinking wtf am I reading I want TECHNICAL DETAILS, well as per last year this is more of a recount rather than a tutorial, but this will contain a lot of insights if you’re curious about running a CTF on the cloud, and a whole lot of gotchas! But also there is a lot of technical words here, spooky clusters, node pools, and load balancers 👻
This post acts a sequel to my post-event infra writeup from last year.
You can check that out here.
THE INFRASTRUCTURE CHALLENGE
My job as a part of the Infrastructure team of Down Under CTF 2021 was to create a secure 🔐, stable 🐎, and scalable🦎 platform for all 69 (haha that’s the funny number!) challenges so that the players can hack the challenges and NOT us (flashbacks2goonies). As per last year we had 3 main areas of infra to focus on:
- CTF Hosting Platform (CTFd)
- Challenge Servers
- Cloud Based Challenges
Given last years setup we had a few points of improvement that we wanted to fix. To give you a quick run down of what we had previously:
- The CTF Platform CTFd was hosted on App Engine (epic)
- The challenges were hosted on a Kubernetes Cluster with 3 nodes with each challenge created as a Daemon👿Set so we would have 3x replicas, routed by HAProxy with IP pinning enabled so we could kinda have stateful challenges🐵.
- Cloud Challenges hosted on Cloud (wow nice could of guessed that one✅)
For the most part, running CTFd on App Engine worked perfectly, scaling up as we get load, especially at the start of the CTF and scaling down when there is less traffic. This did not change THAT much. However the way that we ran our challenges changed a lot this year.
There were a few issues that the previous setup had.
Firstly, if a player was allowed to break a challenge with something like
rm -rf /*, it would break the challenge for everyone😢, this put a restriction on the types of challenges we could create.
Next, a 3 node cluster worked great, however it wasn’t truly scalable and increasing this would require, manual intervention in the HAProxy routing config. Meaning if we wanted to scale up/down the number of nodes, we would have to do a lot of manual configuration.
IP Pinning to nodes caused some issues with players running over VPNs or via a public network, such as a university or public library, since we introduced rate limiting per IP and this caused issues.
Finally, Cloud Challenges were not secured🔓correctly and we almost had a total project takeover, thankfully someone reported it. This was remedied this year!
Now to be fair a lot of these issues came from purely a lack of experience and knowledge around Kubernetes and just basic oversights about running a CTF (this was our first one so give us a break!).
So how did we address those issues? well keep reading and I will tell you :)
TRULY SCALABLE CTF CHALLENGE INFRASTRUCTURE
⚡TIME TO GET TECHNICAL⚡
We knew from the outset that our HAProxy setup had to go👋👋, it was not feasible to continue. It’s not you, it’s us :/. Last year we had 3000 players, this year we were forecasting☀3000–5000 players and so we needed to build a platform that allows us to scale to 10,000 or even 100,000 players!
To do this we re-jigged🧩how our challenges were hosted and how players connected to them.
To do this we first created each hosted challenge as a K8s Deployment, this time instead of Daemon👿Set. We had 2 replicas of each challenge. If the challenge required a stateful datastore component, we would create a separate Deployment with a single replica that hosted this data. The two pods were then able interact with each other and ONLY each other with some fancy K8s egress config.
Then for the Service for each challenge for HTTP challenges we used a Cluster IP and for everything else we used a NodePort.
For example here is the Kube Deployment setup for our Farsight challenge, which was a Web Challenge.
Please note carefully future CTF organisers📒:
This is super important and stops players from taking over the Kubernetes service account for that deployment. See the Whale Blog🐋challenge to see what happens when you don’t do this.
Routing to Challenges:
So now we have our challenges deployed, how do our players connect to them?
We had to use two approaches here, one for HTTPS challenges and one for everything else (pwn, rev, crypto etc) which doesn’t go over a TLS connection. This was done using two Load Balancers.
Starting with the HTTPs challenges we essentially followed this guide here https://cloud.google.com/community/tutorials/nginx-ingress-gke with a few modifications. In essence we put a TRAEFIK entry point behind a L4 Load balancer and routes according to the Ingress Routes configured per challenge. We did this via Hostname match, you can see this on Line 78 of the snippet above. We also added in cert-manager to certify our certificates certifiably 🎀 and give us that sweet 🔒 icon in your browser.
For the non HTTPs challenges we had a similar setup to last year whereby we had a L4 TCP Load Balancer routing to deployments based on the port exposed. These challenges were built off the base nsjail image which spawns a new process for each new connection, completely isolated from every other competitor, neato!
Time to Scale!
Now that we had all our challenges in place, this gave us an awesome base framework to scale from. If our challenges needed to scale up since the existing replicas were overloaded a new one was just created automatically with autoscaling enabled. We set this up with the following command:
kubectl --namespace ctf-challenges autoscale deployment --min 2 --max 5 --cpu-percent=100 $(kubectl --namespace ctf-challenges -l 'role=chal' get deployment --no-headers -o custom-columns=":metadata.name")
This will create a minimum of 2 replicas and if one used 100% of it’s CPU allocation, a new one would be spun up, up to a max of 5.
But hol up, you can’t just scale to infinity???????????? you have resource constraints on the nodes you are running the deployments on. To much scale=node ☠ded☠.
Well this is actually really easy to overcome with ⭐Node Pool Autoscaling⭐. For this CTF cluster we had 4 Node Pools
- Isolated Pre-emptive Node Pool
- Isolated Standard Node Pool
- Pre-emptive Node Pool
- Standard Node Pool
Now that’s a lot of pools, but not for swimming🏊♀️.
We had 2 pools for isolated challenges (more to come about this later) and 2 pools for the regular challenges. We scheduled all our stateless pods on the pre-emptive Node Pools to save money and if the Node was taken back by our Google Cloud overlords we could just spin them back up on another node no problem. For the stateful pods that stored data (DBs, Redis, LDAP) we scheduled them to have a Node Affinity with the non pre-emptive pools so they were less likely to experience interruptions.
So with this all in place, if we needed to scale up the challenges horizontally the challenges would automatically scale, and boom STABILITY. And (Sorry English teachers I know I know I'm not meant to start a sentence with a conjunctions) the best part was that if the nodes couldn’t handle the extra load, another one was spun up when required. The opposite is true as well, if there was not much load, the nodes would scale down, saving money 💰 and resources. Pure stonkage.
So with Node Pool autoscaling enabled and App Engine Autoscaling enabled, this is truly scalable CTF infrastructure.
The only gotcha here, is the CTF Database. Our CTFd database was a CloudSQL instance which does not support Autoscaling of resources (it does for Storage only). This means that your database should be scaled vertically enough to handle your expected load. As updating Database resources creates a few minutes of downtime.
Now if you’re familiar with hacking platforms such as Hack The Box (insert link), TryHackMe or other CTF events where you have something like this:
Essentially you can start the challenge on-demand and you get a private instance of the challenge, that only you can f*** up and only have yourself to blame if the instance dies (ideally).
We wanted to replicate this so that we could allow our authors to create more unique challenges that we were okay with players messing with, and so they could re-create the environment on-demand.
The problem with this is that there are no open-source solutions that implement this kind of functionality. What does this mean for the Infra team?
Get 2 Work BOI.
Here was the solution we came up with.
The Challenge Manager
We created a custom challenge deployer which was in charge of deploying challenges on demand on a PER-TEAM basis called the Challenge Manager. This was hosted on the K8s Cluster in a separate namespace from the main challenges and would deploy the challenges as they were needed.
So how do you get a Pod that is running on a cluster to deploy other Pods on that same cluster? We created a dedicated GCP Service Account that would run on the challenge manager which was configured with the needed RBAC controls (RAS Syndrome Alert🤖) so that it itself could deploy and destroy on demand.
So how does it actually work then?
The Challenge Manager has an API endpoint listening for 3 mains requests:
- GET — Finds the relevant deployment of a challenge for a team and returns the details
- CREATE — Creates a new Deployment for a team
- DELETE — uhhhhhhhhhhhhh, surely you can get this 😉
Using authentication which is passed from our CTFd plugin (more 2 come), calling one of these endpoints grants the Challenge Manager knowledge of which team is interacting with the API.
For creation, it does the usual checks like if a deployment already exists and all that good stuff and when a deployment is good to go we pull a template deployment.yml file from Datastore which looks very similar to the deployment.yml above but with a few variables that we dynamically generate, based on the team ID and the TTL of the challenge.
This file is then rendered and deployed and the details are passed back to the player through CTFd.
A basic diagram of how this works is below
The final piece of the puzzle for this setup is how to setup Challenge Expiry. This was done by using Kube Janitor. This handy service allows you to tag a deployment with a TTL and when that time is up the JANITOR comes and takes down the deployment.
Then all we needed to do was integrate this all into CTFd. We did this by writing a custom plugin for CTFd which did 2 things. Firstly it created a new challenge type called kubectf which had the properties needed to create a new isolated challenge, as well as handle the requests to GET/CREATE/DELETE new challenges.
This all resulted in something that looks like this to the end user:
Each team had a unique host sub-domain and then we used our Ingress Routes like before to route to it. Pretty cool hey!
So now was time to load test this bad boy. Last year we had around 1000 teams sign up and this year we were expecting more. This meant that as a worst possible case scenario we would have 2000 isolated challenge instances running. So we ran a load test to see how our setup would handle it, spinning up 500 instances of the Zap challenge, and it worked really well! We noticed that the GKE Control Plane API started to slow down quite a bit, so we implemented cacheing of requests to hopefully reduce the amount of requests going to GKE.
So then the big finale was seeing how this would go in prod during the real CTF!
This worked REALLY well and we didn’t have any reported issues using this setup, which is super satisfying. We did only put one challenge up using this method just to test if it worked or not and whether we could scale this up later or not. We plan to do this more in next years CTF with more testing and rigour :). The main thing to look out for here is resource usage, since each Pod will reserve it’s own amount of Compute resources, so calculating🧮that will be important for next year.
SO INRFA WAS PERFECT?
When 7:00 PM rolled around and the huge load of requests came in it caused our infra to scale up massively, spooking the infra team to see if our setup would work😲. We auto-scaled from 2 -> 8 replicas of CTFd on App Engine and our Node Pools scaled from 2 -> 3 nodes. We had a bit of a slow down while it scaled up, but NO downtime! Contrary to my favourite meme created about the CTF may lead you to believe 😏
We did have one main issue throughout the CTF which was we had about a 1% 500 error rate on CTFd because of this pesky fellow:
Essentially, CTFd was not returning connections to the database back to the Pool for reuse and so new connections were being created up until the max amount allowed set by CTFd until it 🅱r0ke.
We “fixed” this through 3 techniques: increasing the number of workers on CTFd to 9 per replica, reducing the configuration of the recycle time on connection on SQL Alchemy to 10 minutes and finally, running a
gcloud app deploy every couple of hours to drain all the connections and have the instances create new ones.
We think this is a problem with CTFd as we have heard other CTFs suffering from the same problem, however we could not find the root cause for it.
We also pushed to prod while implementing some new features which broke parts of the site but that was easily fixed using App Engine roll backs!
Some challenges also had some issues, but the underlying Kubernetes Cluster setup had 0 issues throughout the event. The following is a graph of each nodes usage throughout the entire event, and they were all pretty heavily utilized.
For next year we again want DownUnderCTF to be 🅱igger and 🅱etter and we aim to be the most secure, stable and scalable CTF available to players.
To improve next year we want to:
- Create an Automated CI/CD Pipeline for challenges and CTFd (no more testing in Prod)
- Have a test instance of CTFd running to verify any changes we make before we push them to prod.
- Increase DDOS and Dirbuster protection, we had some players scanning the infrastructure which were promptly banned, but we would like to automate this.
- Have automated solve script health checks on each challenge which run’s periodically so we can verify each challenge is working and alert us if it isn’t
- Consider alternatives for the CTF platform other than CTFd like rCTF.
We think that these improvements will allow us to scale more and grow to become a staple CTF in the CTF yearly calendar.
Running a CTF can be challenging and requires a lot of foresight when it comes to designing challenges, infrastructure and general management. I learnt so much this year about Kubernetes and the Cloud building the infrastructure alongside the AMAZING and(and I can’t stress this enough) TALENTED team we have behind DUCTF.
The DUCTF Infra Team🔌:
- Tom Nguyen→ https://tomn.me/
- Jordan Bertasso → https://d3lta.dev/
- Sam Calamos → https://samcalamos.com.au/
- Emily Trau → @emilyposting_
Also a note to our Sponsors we literally could not have run this event without you ❤❤❤
Even though the event just ended, I am already thinking about some cheeky challenges and setups for next year. If you would like to read an Author’s perspective of running the CTF I would highly recommend you check out Joseph’s reflection here.
If you’re interested in checking out the Source Code for the Infra setup this year we have open sourced it here:
- https://github.com/DownUnderCTF/kube-ctf — Challenge Manager
- https://github.com/DownUnderCTF/ctfd-kubectf-plugin — CTFd Plugin
If you have comments, or questions please feel free to reach out to me! I would love to see what you think about our setup and if you have any improvements.
See you next year!