Written by Chanel Chang on February 28, 2019
GumGum has been around for over a decade but we’ve been growing especially fast within the past two years and here comes the problem that we as well as many other growing companies are trying to address: “How do you prevent the rate of infrastructure cost growth from surpassing the rate of company growth?”
Here is a graph of our monthly AWS expenses from January 2017 to October 2017. As we grow as a company, so does our monthly AWS cost since we have to run more instances and services to handle the growing number of requests.
And here is a graph of our monetizable traffic growth. Monetizable traffic is defined by the number of monetizable events which for us consist of views, clicks, plays, and other events on ads that we get paid for. As you can see, the trend looks clearly different than the costs graph.
Before I identify the cause of the problem I’m going to share with you a bit more detail about our advertising product. The highly scalable environment I mentioned in the title is the one for ad serving which is our most mature and cost heavy product. As you can see in this graph, the number of requests our ad servers receive vary greatly depending on the time of the day.
In our largest datacenter, our ad servers receive 18 million requests per minute at peak traffic hours. The lowest number of requests they receive is still 4 million requests per minute. At peak hours, we run 130 c5.9xl instances. The cost of the EC2 instances alone is approximately $120K a month.
Below is a simplified breakdown of our infrastructure before we integrated with Spotinst. We ran EC2 instances within an Auto Scaling Group which received requests from an Application Load Balancer. We used CodeDeploy to deploy new versions of our ad serving application to an Auto Scaling Group. The issue lied with the EC2 instances. As our business grew and we received more requests, we were scaling with more and more on demand instances. As you may all know, on demand instances are the most expensive type of instances you can run.
So we didn’t want to run more on demand instances but we scale too heavily throughout the day to run all reserved instances. Our solution was obvious: run spot instances!
But how to run these spot instances… We had a few requirements. First, we needed a managed service. Being a small company, we needed an easy solution that would not require an engineer or even multiple engineers to maintain it. Second, even if we run spot instances, we needed to make sure that we maintained capacity in our auto scaling groups. The biggest drawback of spot instances is that they can be taken back by AWS at any time with a two minute warning. However, in our smaller datacenters where we’re running less than 10 instances, we can’t afford to lose capacity as this would lead to production issues. Therefore, we needed a service that would guarantee that a group’s target capacity was always met. Lastly, we wanted something easy to use. Preferably something with a console that any engineer can easily maneuver around.
Spotinst offers some amazing features that really set them apart from the other options we considered. First and foremost it is a managed service. Second, they do preventive replacements. What this means is if Spotinst predicts that a spot instance running in your group will need to be taken back by AWS, they will add another instance to your group before an instance gets terminated therefore maintaining your group’s capacity. Also they offer the option to use multiple instance types together. We primarily use c5.9xls for our ad servers but since these instances are very high in demand, their market is not always stable. So we configured our groups to temporarily run c4.8xl’s in the case of c5.9xl market instability and when the c5.9xl market stabilizes, Spotinst replaces any c4.8xl’s with c5.9xl’s. Also, Spotinst has a great console that is really straightforward to navigate.
Here is what our infrastructure looks like after we integrated with Spotinst. We still receive traffic through the Application Load Balancer but now instead of an Auto Scaling Group running many on demand EC2 instances, we have Spotinst’s Elastigroups running all spot instances. We also started using Spotinst’s deployment service to switch over from in place deployment to blue/green deployment.
The switch to blue/green deployment was an added bonus that we weren’t even expecting when we first integrated with Spotinst. Previously we were doing in-place deployment through CodeDeploy which had a few drawbacks. First, it was slow because we could only take out a small percentage of instances to deploy to at a time. Also, in our smaller datacenters where a group had only 4 or 5 instances, even taking a single instance out of the load balancer to deploy to overloaded the remaining instances and caused latency spikes. However, doing blue/green deployment using Spotinst’s roll service, we were able to reduce our deployment time by more than 50%. Also, since we bring up the green instances and add them to the load balancer before terminating the blue instances, we actually see a temporary spike in instance count so we don’t have to worry about latency spikes and other issues from capacity drops.
Now let’s get into the part of the post that you are probably most interested in which is how much money we ended up saving with Spotinst. November 2017 was the first full month that we used spot instances and although our usage in hours colored in red is increasing from October to November our total cost in blue is decreasing.
Let me share with you some specific numbers which come from November 2017 which was the first full month we integrated with Spotinst and also our busiest month of the year. Our expected cost for this month was $174,508. Our actual cost was $112,675 which includes Spotinst fees. Our EC2 usage in hours was 191K hours for a single month. In total, we saved $61,833 dollars this month. This is over 35% in savings which is a significant percentage when your annual AWS bill is in the millions.
In case you don’t believe the calculations that we did on our own, here are the numbers from our AWS cost explorer page. The simple trend to notice here is once again even as usage in hours in green increases from October to December, our total spend in blue is decreasing. This is thanks to Spotinst.
I want to briefly mention some other AWS services that we were able to migrate to Spotinst and they are AWS ECS, AWS EMR, and AWS ElasticSearch. The most notable one here is AWS ElasticSearch. Spotinst offers a way to automate transferring data of a stateful application and attaching a new instance to an EBS volume for spot instances.
Our next step with Spotinst is doing a proof of concept with Ocean. Ocean is a serverless Kubernetes engine that offers cost and performance optimizations for your Kubernetes cluster by managing the scaling of your cluster as well as running your cluster with a variety of instance types and purchasing options (spot, reserved, on demand, etc).