Automated Infrastructure Scaling using ML

Games24x7 Blogs
8 min readOct 24, 2022

--

In the fantasy gaming sector, My11Circle has grown exponentially year on year. It’s one of the largest fantasy platforms where players come in, create their own fantasy teams, and join various contests we host. Our innovative product features make sure that our players get enough adrenaline rush when they are on our platform. As a result of this exponential growth, for any service, the ability to scale rapidly has become a default non-functional requirement.

One of the fantasy sports that we have on our platform is Cricket. We see a huge volume of players creating fantasy teams and joining various contests for this sport. This sport by its nature itself adds a good level of complexity to the player request volume which we will talk about in some time. To cater to such a high volume of requests, we need scalable, resilient services. Once we have scalable services ready, the next set of challenges is to plan the capacity for the infrastructure and a strategy to scale for mega events like IPL, World Cup, or any India match.

In this blog post, we will talk about how we plan the capacity of infrastructure and how we scale for mega events in cricket using our Fantasy AutoScaler. But before we jump to these topics, let’s first taste the complexity that cricket brings to the platform and how magnified it is at the My11Circle scale.

What’s complex in Cricket w.r.t player request volume?

  • Any cricket match which is scheduled by the cricketing board becomes open for registration on our platform a few days before the match start time. This is the second phase in the match lifecycle where a variety of hosted contests having different sizes and prize callouts see a rush of player joins.
  • The next crucial phase begins when the toss actually happens on the field. Toss is important because, at this point for the first time, both the teams declare their final list of on-field playing players.
  • At this toss point, all our players come back to our platform and make amendments to their fantasy teams and join more contests until the match starts. And this is where we get more than 50% of the teams joining in a match.
  • This sweet period of 30 odd minutes poses us with interesting challenges which include and are not limited to supporting around 18K joining requests per second, 300K requests per second, keeping fast-filling contests available, high availability of the entire system, etc.

As shown above, the rate of requests increases exponentially after the toss. This is the complexity that cricket sport brings in, and at My11Circle scale this becomes a nice and challenging problem to solve.

Scaling types

In general, scaling is of two types: reactive and upfront scaling.

Reactive Scaling

In this scaling type, we react and scale to some events like requests per second going above a threshold, or CPU/Network going above baseline thresholds. At My11Circle scale, which receives exponential growth in request volume in a very less time window, reactive scaling will not help. Time to react and scale in reactive scaling is slow in EC2 and AWS Auto Scaling Group (ASG). For any scaling event, by the time we get new VMs up and running, it’s already late and our request volume reaches another exponential peak.

Upfront Scaling

Due to the exponential nature of growth in request volume, we are left with the upfront scaling, in which we plan the infrastructure capacity upfront and scale our infrastructure a few hours before the match starts.

The first step in upfront scaling is to plan the infrastructure capacity that we need for any event. In doing this, we need to strike the right balance between cost and size of the infrastructure. If we scale for a bigger capacity then we will waste a good amount of money. At the same time, if we scale for less capacity, then we will not be able to serve our player request volume in a seamless way.

Once we have planned the capacity, the second step is to have enough tools and automation to enforce and scale the infrastructure a few hours before the match starts. We also need to monitor in real-time that our capacity is still intact for the entire match lifecycle.

A native approach to Upfront Scaling

  1. Before the start of any mega events like IPL or WorldCup, we, for each match in that event, categorize it in a match liquidity band. For example, for an India vs Pakistan match which receives a good amount of player joins, we add that match to a band named LargeBand. Similarly, for a match of RCB vs MI in IPL, we add it to a band named MediumBand. Here, we keep a reasonably low number of liquidity bands to reduce our overheads.
  2. Now, for each such liquidity band, we specify what size of infrastructure is needed for 50+ microservices that we have, to host any match falling in that given band. We prepare the band versus the size of the infrastructure and keep it beforehand. For example, in a LargeBand for say service A, we need 50 machines. But for the same service A, we need 30 machines in a MediumBand. To reiterate, we do this for 50+ microservices that we have.
  3. At last, we need to schedule scaling of infrastructure based on the match start timings and the infrastructure size in which that match falls in based on the allotted band.

Challenges faced in the native approach

  1. Match to liquidity band allocation is a manual process. Once we decide on a band for a match, writing, and reviewing CRONS (scheduling) is also a manual process. We have automated it to some extent but, still, it requires zoom calls, spreadsheets, and multiple teammates to double, and triple-check it.
  2. The match to liquidity band allocation is static. This means that, for example in World Cup 2021, when India started losing the series, people were not showing enough traction to join and play those matches. But, we were still scheduling India matches in a large size band even though our players were not playing those matches with full enthusiasm. Due to this static nature of allocation, hosting these matches was becoming pricey for us. We were paying for idle infrastructure.
  3. To save cost, we intervene multiple times during a mega event and reallocate the match to liquidity band size mapping. But any such intervention means more human efforts, spreadsheets, and zoom calls with many folks across various teams. This is something that is not scalable from a human resource point of view and is also prone to human errors.

What do we need to effectively scale upfront?

To effectively scale and keep costs and human efforts to a minimum, we need the below things.

  1. An effective, automated way to calculate the liquidity for each match in a given series.
  2. A formula to map match liquidity to infrastructure size that we need.
  3. An automated way of scaling-out infrastructure a few hours before match start and that too based on the above liquidity values and the formula.
  4. An automated way to bring our infrastructure down to baseline when the match is over.

Basically, what we are doing above is similar to our native approach. But this time, there are no static bands and human interventions.

  • Infrastructure size is a direct function of match liquidity. Since every match is unique and will have unique liquidity numbers, it will also scale up with unique Infrastructure capacity numbers. If we do all the above things right, then our infrastructure will not be wasted due to idling, and at the same time, we will be able to cater to the player request needs for any match.
  • All of this needs to be fully automated to reduce or rather remove the human efforts involved.

Fantasy AutoScaler’s approach to effective upfront scaling

Let’s reiterate the things needed to effectively scale upfront but this time with a solution.

  • An effective, automated way to calculate the liquidity for each match in a given series.

Solution: A machine-learned model, running in a serverless fashion to predict match liquidity based on features related to that match, like join rate, teams playing in that match, series details for that match, etc.

  • A formula to map match liquidity to infrastructure size that we need.

Solution: A formula built using linear algebra and historic performance test numbers for each service.

  • An automated way of scaling-out infrastructure a few hours before match start and that too based on the above liquidity values and the formula.

Solution: Scripting, calculation logic based on the above formula, dynamic scheduling based on match start time, and all of this in a serverless lambda.

  • An automated way to bring our infrastructure down to baseline when the match is over.

Solution: Again it’s scripting done in a serverless lambda with a time-based trigger.

Ingredients of the Fantasy AutoScaler project

This project is built on serverless technologies to keep zero cost of operation. This includes the below components.

  1. Service to gather features for a given match. This service then invokes lambda containing a DS model a few couple of hours before match start time.
  2. ML Model-based prediction Lambda. This lambda predicts the liquidity for the match based on various features. This lambda holds a model inside it which is updated dynamically whenever learning happens.
  3. ScaleUP Lambda. This lambda gets the predicted match liquidity and then it calculates the infrastructure size. Finally, it makes various API calls using AWS SDK to scale ASGs. We use multiple AWS Lambdas, AWS StepFunctions to accomplish the job.
  4. Monitoring Lambda. This is responsible to make sure that our infra is actually running in the same capacity numbers that we need to host the match.
  5. ScaleDown Lambda. This is responsible to bring the infra back to baseline once the match is complete.

Results and Next Steps

  • We have been using the Fantasy AutoScaler project for various important cricket events that happened so far in 2022.
  • With this project, we have significantly reduced our running costs and the human efforts needed to host mega events in the Cricket world.
  • With Monitoring Lambda in place, we get notified immediately if our infrastructure is under provisioned, or if a subnet is exhausted and is stopping scale up activity, or if AWS does not have sufficient hardware. All such checks are in place and have helped us a few times in the past.
  • The next step is to add integration with K8s and other third-party softwares. By doing this, our costs will further reduce drastically since running third-party softwares is costly at My11Circle scale.

About the author

Suraj Sharma is an Engineer at Games24x7.

He has been in the software space for around 10 years, mostly working on distributed systems running at high scale and concurrency. He enjoys working on Serverless Architectures, Distributed Computing and Databases.

Find him on LinkedIn here : https://www.linkedin.com/in/surajs121

--

--

Games24x7 Blogs

Welcome to the world of Games24x7! We talk about the science behind gaming, engineering, our work culture and lots more. Stay connected, keep gaming!