Infra overview and planning: IEEECTF 2020

Published in

Techloop

7 min readMar 30, 2021

About IEEECTF 2020

We at IEEE-VIT conducted a 36 hour CTF from 31st October 2020 to 2nd November 2020, with a variety of different challenges aimed at beginners and enthusiasts. It was crucial to plan our infrastructure and tech stack carefully and extensively, because we wanted to ensure that the CTFers playing had a nice, flawless experience. It would be disastrous if they found a bug/vulnerability in our infra that they could exploit, and they’d probably spend more time on that, than on the actual CTF challenges ;).

We had around 1500 registrations from all over the world so we knew our infrastructure needed to

be scalable
be highly available
have zero downtime (or at least try to xD)

Also, the fact that we are a non profit student organisation means that we are broke and hence have to do all of the above while minimising the hole in our pocket :)

Before we start, here’s a big-ass flow chart giving a rough overview of the entire infra setup, for someone who wants a TL;DR

Yes I use Figma for stuff other than making memes as well :)

Overview

As apparent from the flow chart, the infra consisted of three components:
1) A k8s cluster, where the challenges are hosted.

2) A Vercel app running our React frontend.

3) An App Engine instance running our node.js backend, with Firebase for authentication and Firestore as our NoSQL DB.

Although, not shown in the above chart, we also had a CI/CD pipeline setup for which we used TravisCI.

The Platform

There are multiple CTF platform solutions (picoCTF, FBCTF, ctfd, etc), already out there which have been battle tested and built for out-of-box usage. If you’re low on time and/or you don’t want feel like putting a lot of effort in, we highly recommend you to check these out.
But, we like making our lives more miserable and difficult, so we decided to make our own CTF platform from scratch.
We decided to go with React for our frontend, as we planned the UI to be around a big cool globe and react-globe.gl, looked like the package that promised to be the solution to this. If you’d like to know on how we implemented an UI as slick as shown below, head on over to this article.

We used Firebase auth, as it is ridiculously easy to implement and makes things like “Login with Google” painless. We chose Firestore over MongoDB, mainly because we already had a Firebase project setup, so it was simple and they also have a pretty generous free tier which helped us in minimising costs.

Since we chose Vercel and App Engine, deployment was fairly easy and straightforward. Vercel’s per commit preview feature was also very helpful during the dev stage, as it let us test out any subtle bugs that arose while implementing the complex globe component. Both Vercel and App Engine are managed services, so we didn’t really need to bother ourselves with stuff like load balancing, setting up TLS/SSL, etc and this also was a crucial factor in us choosing these two technologies.

We decided to add Cloudflare to our infra as well, because of the following reasons:

to avoid getting DDoS’d
to improve response time through caching
their pretty dashboard xD.

We highly recommend using Cloudflare, as it is really helpful in providing analytics about your website, without invading the privacy of your users. Also, it’s always nice to have DDoS protection, especially when your target audience consists of folks experienced in exploiting vulnerabilities and attacking services.

Geographic spread of our users during IEEECTF 2020

All of these stats helped us in analysing not only our audience but also our infra setup, so that we can improve next time. One thing Cloudflare is really good at is caching requests, which is helpful as your server is getting hit less often now, thus decreasing the load on it and the response times are super fast. It also informs us of where most of our users were playing from, which lets us know how far our Publicity and Marketing team were able to get the word out. Unsurprisingly, most of our users were from India, but as you can see we had quite a few users from Europe and some from China, South-East Asia, Australia and USA.

The k8s Cluster for Challenges

This section covers the cost, analysis, etc of our k8s use. If you want to know how we actually set the cluster up with code snippets and all, consider checking out this article.

We decided to go ahead with Google Kuberentes Engine (GKE) as we had $300 worth of Google Cloud Platform credits. You can avail them too by making a new account.

The cluster consisted of 3 (E2 series), each with 2vCPU and 8 gigs of RAM, with automatic scaling to make sure that the cluster wasn’t being utilised and we weren’t paying more than we needed to. You can definitely try to fiddle with this configuration to suit your needs. We would highly suggest you to use the pricing calculator that GCP provides, which lets you analyse just how much money you’d end up paying. Note that, the nodes itself are not that expensive, since at the end of the day they’re just preemptible Compute Engine instances. It’s the Load Balancer which is going to cost the big bucks. Here’s the estimate we calculated:

As you can see the Load Balancer costs almost 4x more than the physical cluster nodes themselves. Also note that the cost of the Load Balancer is directly proportional to the number of forwarding rules. So practically speaking, the more number of challenges you have, the more your Load Balancer cost will increase, since each challenge would require it’s own forwarding rule. If you can’t afford to spend a lot of money, then consider setting up a load balancer using Traefik or HAProxy on your own.

We had 10 challenges running on our cluster, each challenge had it’s own Deployment, where each challenge belonged to either the web or jail category. You’ll notice there are 11 Deployments in the below screenshot. The extra deployment was required to host a MySQL server for our Haunted House question, which involved SQL injection.

All Deployments that needed to be accessed publicly had a related ClusterIP service, which were all ultimately linked to our Ingress service which was handling the routing.

ClusterIP, Ingress and Load Balancer services for our cluster

All Deployments had a RollingUpdate strategy, to minimise downtime and makes sure all challenges are available for users to play at any moment of time.

GKE dashboard showing monitoring stats about all resource types.

One area where we could’ve done a better job was integrating Prometheus. We we didn’t leave enough time to experiment with Prometheus and host it as a sidecar. It’s definitely worth setting up Prometheus on your cluster, helping you make very fine tuned decisions based on the metrics.

CI/CD

We decided to use TravisCI as our CI/CD provider, since provide student accounts with 1000 free builds/month, which was enough for our testing and prod purposes. We wrote a simple bash script, to help with deploying all changes pushed to master to our GKE cluster automatically. Setting up TravisCI with GKE is super simple, so we won’t be going into that.

One thing to note that is your build time can vary a lot, depending on the changes, since all questions are dockerized. Our highest build time, was around 10 minutes, which was mainly due to a web question written using Ruby on Rails (thanks a lot gem dependencies ;_;).

Conclusion

Overall, this infra setup worked well for us. We could’ve definitely things better and more efficiently but the cluster had almost zero downtime, and we didn’t have any users (almost xD) complaining about not being able to play a challenge. This was the first time, that our team was using k8s in production and it gives us immense pleasure to say that it was done so successfully 🚀. Moreover, the platform too had zero downtime and we had users appreciating the super cool UI of it, which was a real big +1 for our frontend and UI/UX team ✨.

Please also checkout the related repositories and consider leaving a 🌟 if you like it:

Frontend: https://github.com/IEEE-VIT/CTF-Frontend
Backend: https://github.com/IEEE-VIT/CTF-Backend
Questions and k8s: https://github.com/IEEE-VIT/IEEE-CTF-Questions

If you enjoyed reading this article, do checkout the rest of the articles in our CTF series: