Infrastructure and Load statistics for csictf 2020

Rishit Bansal
csictf
Published in
11 min readAug 1, 2020

Statistics/infra details from a real CTF to efficiently plan infra for your CTF!

About csictf 2020

csictf 2020 was a 4-day CTF, held from 18 July 2020 to 22 July 2020, and hosted over 1400 teams! Since we anticipated high traffic as it was a CTF targeted towards newcomers, we knew that it would be crucial to structure our infra correctly.

The main goals while planning infrastructure for the CTF were :

a) 100% (or close to :) ) uptime for both the CTF platform and all the challenges.

b) efficiently size cloud machines to minimize cloud credits spent.

c) ensure that the infra is scalable, and can be resized both easily, and quickly based on load.

d) ensure that the response time for challenges is always acceptable, irrespective of the user’s geographical region, or load on the server.

This article will give a brief overview of every component on our infra, and also some statistics we collected after the CTF.

A brief overview of the infra

TLDR, here is the whole infra summarized in a diagram

Yes, I spent a lot of time making this diagram

The whole system can be separated into two main parts:

  1. The CTF Platform (CTFd)
  2. The Challenge Cluster, hosting each of the CTF challenges

Below, we cover each of these parts in more details, with statistics from our CTF, to help you get a reference for your CTF

Lastly, we go over the budget planning involved for the infra.

The CTF Platform

Note: This article mainly deals with statistics from our CTF for the instance running CTFd. If you want to know how to setup CTFd, refer to this article in our series!

We decided to host CTFd instance, and the database on the same instance, to simplify our deployment. You could consider, for example, instead to link the CTFd instance to a Google Cloud SQL instance, which can greatly reduce the disk I/O and CPU load on the actual VM hosting the platform.

The instance that was running CTFd was initially an E2* series 2vCPU 8GB RAM machine, but on the day of the CTF we scaled it up to an E2 series 4vCPU 16GB RAM machine, and it stayed on this machine till the end of the CTF, after which we scaled it down to a 1vCPU 4GB RAM machine.

Here are some graphs showing the load on this machine throughout the CTF:

Note: We were running 10 gunicorn workers of CTFd throughout the CTF on this machine

*We recommend choosing an E2 series machine (over the default N2 type) if you’re using GCP, it should greatly cut down your costs!

CPU Usage

CPU Usage of CTFd instance during csictf

The peaks on the CPU usage on the 20th and the 21st refer to times when we released a new wave of challenges. Regardless, we can see the max CPU usage peaked at ~ 60% so on retrospect, if you want to further cut corners, a 3vCPU machine or a 2vCPU machine (with the database hosted on CloudSQL) could also be sufficient.

RAM Usage

RAM Usage of CTFd instance during csictf

The RAM Usage shows a ~ linearly increasing graph, we suspect this is the Redis database used by CTFd slowly caching more and more data over time in memory. A 16GB RAM machine is recommended, but if your CTF is not 4 days long, you may be able to reduce this to 12GB, or 8GB, depending on the duration. Going any lower then that is not recommended, because as we can see the base RAM usage when the CTF starts is ~ 4–5 GB.

There are more network-related statistics available on the GCP Dashboard like the number of bytes/packets transferred, but I don’t think they yield any useful information. If you’re interested, here is one of them anyway

Total incoming/outgoing network packets avg over 5 min intervals during the CTF

HTTP Statistics from Cloudflare

All traffic to the CTFd instance was proxied through Cloudflare, these can be used to gain additional insight into the amount of traffic to expect

HTTP Requests statistics from Cloudflare during csictf 2020

Traffic per region

Before the CTF, we looked at user registrations and anticipated maximum load from the USA and India, so we positioned our instance in Europe (London) as it lies ~ in the middle, to minimize latency for users. Though geographical traffic may depend on some other factors specific to our CTF (such as publicity, outreach, etc.), and is subject to change, here is the breakdown of traffic per region we received on Cloudflare, it can help you to plan which region(s) to host your instance(s) in:

Geographic traffic distribution through Cloudflare for csictf 2020

A note on setting up Cloudflare correctly

We recommend using Cloudflare for your CTF, mainly because of its DDOS protection features, error logging, and most importantly, caching of static resources. Cloudflare can greatly reduce the amount of traffic on your site by caching requests. Stats from our CTF show that 52% of requests were cached by CF! That’s more than half of the traffic being cached, thus reducing the network load on your VM instance by 50%.

We won’t go over the process of linking your domain to CF, as CF itself provides a pretty nice step by step tutorial for doing so for linking your NS records to CF nameservers. But here are some other things that you should proof check to see if CF was configured correctly:

  1. Ensure that proxy status is enabled on the CF Panel

The DNS A record you set up on the CF panel should be enabled to “Proxy” requests through Cloudflare

Ensure that the A record says “Proxied”

You can proof check if the requests are being proxied through Cloudflare by checking the HTTP response headers when you visit your server:

Note the cf-* headers, those tell you the response came from a CF Proxy

2. Setup your cloud provider’s firewall rules correctly to ensure that CF can’t be bypassed!

Note on the above A record, I revealed my server’s real IP! Well, it’s not that bad as you think, try visiting it on a browser, you should find that the connection is refused. This is because we configured the firewall on GCP to only allow connections from hosts that fall in Cloudflare’s IP Ranges:

GCP firewall rules for our CTFd instance

You can proof check if CF can’t be bypassed on your server by visiting your server’s real IP in a browser, and verifying that the request doesn’t bypass CF and reach your server.

The Challenge Cluster

Note: This article mainly deals with statistics from our CTF for the challenging cluster. If you want to know how to set up such a cluster, refer to this article in our series! You can also refer to the article on ctfup, our own CLI tool to automatically deploy challenges to a GKE cluster from a Github repo.

All challenges which had to be deployed on our server were containerized (docker) and deployed on a GKE (Google Kubernetes Engine) Cluster.

The cluster consisted of 3 (E2 series) nodes initially during the CTF, each with 2vcpu + 8GB RAM, but after day 1 we realized the cluster was being underutilized so we scaled it down to 2 nodes, and later back to 3 nodes on Day 3 when we released our second wave of challenges and simply needed more space to host more challenge instances.

On retrospect, you could instead consider running much smaller sized nodes (1vCPU + 2GB RAM), and running a greater number of them, and scaling up/down during the CTF based on load. This should reduce costs further, as now you have a finer level of control.

We had a total of 28 challenges out of 58 challenges that needed to be deployed/hosted on the cluster

All our challenge deployments (screenshot taken from scaled-down cluster AFTER the CTF)

Challenges Deployments, Replicas:

Each challenge was deployed as a k8 deployment on the cluster, and the name of the deployment was set as the name of the challenge. Before the CTF, we divided the challenges into two types:

  1. Low-Risk challenges: Challenges where RCE is not possible or is highly limited. The deployment for these challenges usually had 4 replica pods throughout the CTF.
  2. High-Risk Challenges: Challenges where RCE is possible and in some cases used to take down the challenge itself. The deployment for these challenges usually had 7–8 replica pods throughout the CTF.

Resource Limits

For most pods, we followed the same resource limits: 200MB of RAM, and 0.1vCPU. There were some exceptions, for example, in XSS challenges that needed to run headless chromium, we assigned 0.4vCPU + 500MB RAM.

Exposing challenges on the k8 cluster

For each deployment, we assigned a unique NodePort service to expose the challenge on a port.

You could instead use Load Balancer k8 services but we found that assigning one Load Balancer rule per challenge would mean we would spend ~50$ in credits over the four days in the CTF just on load balancing alone! So instead, we decided to expose NodePorts on the cluster, and roll out our load balancer in front of the challenge cluster instead of the default GKE Load Balancer!

Enter HaProxy!

We provisioned a 2vCPU 8GB (E2 series) instance which was running only HaProxy and its job was to route connections to challenges to one of the nodes on the cluster, in a round-robin fashion.

For example, if you were to connect to http://chall.csivit.com:30222/ (one of our web challenges), the request would first hit our HaPROXY instance, which would then route it to port 30222 to one of the nodes running on the cluster.

We created GCP Firewall rules to allow each challenge port on the HaProxy instance. (Actually, our CI/CD automatically created these rules using the gcloud CLI tool! We have an article coming soon on our CI/CD, stay tuned)

Example of some firewall rules for the HaProxy instance to allow challenge ports

As we’ll see in the load statistics below too, we can surely say that we went overkill on sizing the HaProxy instance, and even a 1vCPU 2GB RAM instance would have been more than sufficient. Probably lesser would work too, as HaProxy is highly efficient, and all this VM is doing is load balancing requests to other nodes.

Note: Why HaProxy? We used HaProxy because it is highly efficient, and also we needed Layer 4 (TCP) load balancing for the challenges, as some challenges were not web challenges but needed SSH, netcat, etc. to connect.

You could also scrap running another Load Balancer and just route connections directly to the K8 cluster (as K8 does internally divide the load between a deployment’s pods), but we were worried that this would allow an attacker to overwhelm one node by spamming it with requests, and besides, we needed HaProxy’s useful admin panel to monitor our challenges.

HaProxy lets you monitor each routing rule you set up on an admin panel!

Load statistics from a cluster node:

CPU Usage

The CPU Usage peaked at 60% per node, showing that we were slightly underutilizing the cluster, and could have opted for smaller nodes (1vCPU per node, and a greater number of nodes).

RAM Usage

The RAM usage statistics entirely depend on the resource constraints you apply to each pod on the cluster. In our case, the RAM used was 200MB per challenge pod.

Load statistics from the HaProxy Instance:

CPU Usage

CPU Usage in the HaProxy Instance

As we can see, there was barely any CPU usage on this node (like 5% tops), and you are better off using a 1vCPU, or maybe even a 0.5vCPU instance.

RAM usage

The HaProxy VM had a constant RAM usage ~ 1GB throughout the CTF, so you can consider using a VM with 2GB RAM for this instance, ours was, as mentioned before, severely overkill :)

BONUS: Network Packets per second

These are the number of network packets received by the HaProxy instance per second. As we can see, there are two peaks (each time we released a wave of challenges), where the instance was receiving and routing 3000+ packets per second!!

Okay, enough statistics, how much did it cost?

The infra for this CTF was sponsored by Google Cloud, and they provided us 500$ in GCP credits, over the existing 300$ that already comes with every new GCP account, for a total of 800$.

That’s a lot of credits, but to host all the infra for the CTF, including the money spent to host instances during development/testing was approximately Rs 10,000 (approx 133 US dollars). This means you could easily host a CTF with just the free credits provided with every new GCP account too!

Moreover, as we discovered after the CTF (and covered in this article), you could cut down several parts of the infra. Ee estimate you can end up saving around 30–40 dollars more, and host a 4 day CTF in just about 100$ of credits!

Here is the cost breakdown (the image shows per month pricing) for the machines we used during csictf 2020.

So the 133$ breaks down into:

a. 1 e2-standard-4 VM instance for CTFd (the CTF platform)

b. 3 e2-standard-2 VM instances (1 for HaProxy, 2 others were used for our Hack the Box challenges during the CTF)

c. 3 e2-standard-2 VM instances for challenges (nodes on the GKE Cluster)

Some of these instances were running for 10 days before the CTF too, as we had CI/CD setup during development to automatically deploy challenges to a smaller cluster so that we could test them.

The rest of the (133$) cost breakup can be attributed to other costs associated with GCP (static IPs, storage buckets, etc.).

To summarize, we infer that anywhere between 50$-100$ is sufficient to hold a 4 day CTF and to also cover your testing/development needs before the CTF too.

Wrapping up

In this article, we went over csictf’s infrastructure, load statistics, and lastly budget planning a CTF. I hope you find this information helpful while hosting your CTF.

Now that you know what the infra was like if you are interested in knowing how to set it up, click the link below for other articles in our series, dealing with topics like setting up a Kubernetes cluster, or even setting up CI/CD to automatically deploy challenges from a Github repository.

--

--