How We Overcame Spot Node Exhaustion and EC2 Capacity Issues

Oğuz Küçükcanbaz
Trendyol Tech
Published in
4 min readMay 26, 2023

Serving on Trendyol’s ‘Mediacenter’ team, we are responsible for processing and performing AI model analyses on nearly all images and videos that come into the platform. This is for ensuring all visual content aligns with the high standard that Trendyol’s customers have come to expect.
With the task of analyzing between 20 to 50 million images daily so we need machines with GPUs to run AI models, and there is an optimal GPU for each model. Additionally, since the number of images coming into Trendyol can reach 50 million a day, we occasionally need to scale up significantly. We chose AWS because of the variety of instance types offered by EC2 and the ability to provide us with a high number of these GPU-enabled nodes.

We are using EC2 node groups in our EKS cluster.

Having a large number of GPU-enabled nodes, of course, means high costs, and to reduce these costs, we use spot EC2 instances. For those who are not familiar with it, AWS explains spot nodes as follows;

“Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices.”

Everything seems great, doesn’t it? But now let’s get to the disadvantages of spot nodes. To purchase spot machines, you make an automatic bid, and if another user outbids you or requests nodes on-demand, these nodes are terminated and taken from you. This actually happens quite often, but we do not feel it because a new one comes at a different price within a few minutes instead of the leaving spot node.

Our problem begins when a new spot node does not arrive within a few minutes; for example, for one of our deployments, we want 50 spot instances of an instance type called a1, which is in high demand and has high cost/efficiency and a GPU. AWS initially provides us with this, but at an utterly uncertain time, 40 of these 50 nodes are taken from us, and since there is no available spot node, our response time increases, or sometimes all 50 are taken at once, and we end up with 0 nodes and service goes down.

In such cases, we developed a simple solution to stay at a high scale or never go down. In EKS, we have two node groups for this, one with instance type a1 and the other with a2 selected. In these node groups, we have created a taint named gpu-type and given the values as a1 and a2 according to the instance type. In CloudWatch, there are two metrics named InService and Desired for the node group. Desired shows how many nodes we want for that node group, and InService shows how many nodes are running. Our alarm calculates the InService percentage by dividing the InService and Desired values of the a1 node group and passes to the ALARM state if it falls below 0.5. It leaves a message to the topic we specified in SNS for both OK and ALARM states, and this topic triggers a very simple lambda function. Depending on the incoming status (OK or ALARM), the lambda function uses kubectl to change the toleration of our deployment in our cluster to a1 or a2 and saves the day!

Flowchart of the solution we created

Of course, the instance type a2 we chose in the backup node group is less efficient, but we don’t mind it because it’s better than going down or creating a large bottleneck. I would like to mention that according to the data of the last 1 year we have, the frequency of this event is 2 times a month. Additionally, the probability of this situation will vary depending on the instance type you choose and the number of nodes you want.

The increasing demand for AI model serving, especially in the current days, seems to lead us to encounter more difficulties in finding spot nodes on GPU-enabled instances. To minimize downtime, it becomes necessary to explore solutions like this. It is worth mentioning that we have never experienced this issue with CPU or Memory optimized nodes.

I hope we’ve been able to inspire someone who wants to work with a high number of spot nodes without experiencing downtime and bottleneck.

If you find the idea of developing applications on a scale capable of analyzing millions of images and videos every day exciting, why not join us?

--

--