What’s in a Cloud Strategy?

Shubham Goyal
Mighty Bear Games
Published in
6 min readJan 22, 2021

“Cloud” isn’t the buzzword it used to be. This is because anyone who needs a server now has a bazillion cloud providers vying for their business. It takes a couple minutes of your time (and the last four digits of your credit card) to put up a server. So which provider should you be using? Should you be using one of them — or all of them? Should you be using managed services or building your own? I’ll try to answer these questions by describing the cloud strategy at Mighty Bear Games, why it has had to evolve over the last few years, and the lessons we learned along the way.

World of Cloud

Photo by Jack Moreh on stockvault

Back in 2018 the team was hard at work on World of Legends — an open-world MMORPG. (Some of us prefer ma-morp-gah.) Like other games in the genre, WoL required a suite of online features to function. Players could roam shared worlds, randomly challenge others to battle, interact with NPCs, level up their characters, defeat time-limited bosses, fight as a guild, compete on global leader boards, chat with players— you get the idea, WoL was massive.

Massive also happened to be the name of the server-side tech that would power WoL. Massive was built using Java and is heavily inspired by Netflix OSS. Without going into too much detail about the architecture — an entire post unto itself — some of the components involved were Kafka, Redis, Eureka, and SQL databases.

We used AWS for Massive. (See how inspired we were by Netflix?) The cloud strategy at the time was simple: we didn’t want a vendor lock-in. This meant we would only leverage vanilla AWS offerings such as EC2 instances that would allow us to run any workload we wanted. Anything that required use of AWS-specific tech was a no-go. This was so that if all of AWS had to shut down or it no longer served our purposes, switching to Google Cloud or Microsoft Azure would be a breeze. Now that sounds like a sensible plan. Why would anyone want to be locked into a vendor, right? Some of you reading this will agree. The rest still have a chance at being sensible.

Massive Learnings

We had no idea of the scope of the work we had agreed to undertake. Initially, it was just about trying to get the individual components connected and running on EC2. Infrastructure-as-code using Ansible was a step in the right direction. But of course you have bugs, and to debug you need logs. So, first we needed ssh keys for all the devs to securely access the EC2 instances so that they can grep the logs. Great, that’s one less problem to think about. But one night something went wrong, and our players were unable to enter battles.

Dev 1: Hmmm…there’s exceptions in the logs – looks like a data migration issue in the last update. We need something that can alert us to this going forward. How should we do it?
Dev 2: Ummm…how about we setup a cron job that greps for the words “Error” or “Exception” in the logs and send a notification to Slack with details?
Dev 1: Wow. That’s a fantastic idea! Let’s get it done asap. That’s one less thing to worry about.

2 weeks later

Dev 1: Hey Dev 3, looks like our chat service is down and I can’t bring it up. Could you have a look ?
Dev 3: Sure! But I just got a new laptop because my previous one crashed and I’ve lost the ssh key with the crash. Help me set up this new key and I’ll get right to it.
Dev 1: Sure man. Here you go.

2 hours later

Dev 3: So about the chat service — the instance has run out of space due to the log files. I’ve written a new cron job that removes any logs older than 2 weeks.
Dev 1: Awesome – love the initiative! Thanks! That’s one less thing to worry about.

Engineers trying to run World of Legends (2018, colourised). Photo by Hobi industri on Unsplash

This kept going. The few things that we would occasionally worry about became a laundry list of worries that wouldn’t end.

  • First it was the just logs and errors. But then errors were going undetected because the cron job didn’t account for all the scenarios, so we would have to constantly maintain the cron job.
  • There would be random service crashes so a manual restart would be required each time. It occurred to no-one that there’s something called systemd.
  • Kafka instances had to be regularly restarted (we still don’t know why) and once the instance was allocated a new IP, Zookeeper would refuse to function until it was manually configured with the correct IP.
  • Crashes eventually turned into a memory leak that wouldn’t be diagnosed until Prometheus and Grafana were added for monitoring. Just another component that we decided to manage ourselves.
  • Scaling instances was a fully manual process. If you the traffic spiked one night while you were asleep and the current set of instances couldn’t handle the traffic, well, good luck with that.

Developers were spending time maintaining servers instead of developing new features. This was a small team of 6 devs tasked with developing full-stack features and also maintaining the infrastructure. There were no dedicated infrastructure engineers or system admins. Costs were higher than they should have been. To top things off, despite all the efforts to avoid a vendor lock-in, we still ended up using AWS-specific tech such as Route53 and S3.

Getting Butter At It

Fast-forward to 2019 when we began experimenting with side projects. With all the learnings from World of Legends fresh in our mind, we decided to shift our cloud strategy: why not use AWS’ managed services for some of our existing components? So instead of managing the Redis instance ourselves, why not use AWS ElastiCache? And instead of running Spring Boot services on EC2, why not containerise the services and run them on AWS ECS — their managed Kubernetes-like service?

Managed services massively improved our productivity. (See what I did there?)

  • Devs didn’t have to worry about a crashed service because the ECS engine took care of restarting them.
  • Basic service metrics such as CPU and memory usage were easy to analyse and could be used to trigger scaling events.
  • Logs were directly routed to CloudWatch and nobody worried about disk space.
  • CloudWatch alerts were sent to Slack via Lambda and adding new alerts/metrics was easy enough that we decommissioned the Prometheus setup in favour of CloudWatch.
  • There was no need for configuring instance access since there was no need to ssh.

At the risk of this sounding like a sponsored segment (it’s not), ECS was a game-changer.

At some point in 2019 we started working on Butter Royale, and given the tight (6-month!) timeline for the project we knew we didn’t have the bandwidth to make the same mistakes as WoL. So we went all in on AWS with DynamoDb, CodePipeline, LoadBalancers, and more. After almost a year in production with 99.9% uptime, regular content updates, and a growing player base, it’s safe to say we made the right decision.

Conclusion

Leveraging AWS’ offerings allowed the team to

  • Focus on features that were important to players
  • Reduce maintenance overheads
  • Reduce server costs
  • Sleep soundly at night

The only real compromise? We aren’t moving away from AWS anytime soon. Given our past experience, however, the benefits of vendor lock-in clearly outweigh the cost of avoiding it. Still, it would be unfair to say that everyone should be taking the same decision. If we had a team as big as Netflix or the FBI to manage EC2 instances, we could certainly do more.

Ultimately, though, embracing AWS was a deliberate decision that worked out very well for us. With each new project we continue to look deeper into AWS’ offerings. This isn’t to say we pick anything we spot on AWS; instead, we evaluate what AWS has to offer and see where it fits with our needs. Given a choice between self-managing an infrastructure component or using an AWS managed component, there has to be a strong reason (that isn’t vendor lock-in) for us to disregard AWS.

If you enjoyed this article, here’s a great follow up in Multi-Cloud is the Worst Practice. If you disagree or have other thoughts, I would love to hear about your own experience in the comments!

--

--