How Salt Security Cut Their EKS Costs by 40% With Autonomous Optimization

Intel Granulate Tech Blog Team
Intel Granulate
5 min readApr 17, 2023

--

We recently hosted a webinar with AWS where we spoke about the Big Data optimization challenges that Salt Security faced as it scaled up its infrastructure due to the rapid growth of customers using its ML-driven API security solution.

About Salt Security

Salt Security offers content-based security for APIs. The platform combines complete coverage and an ML/AI-driven big data engine to provide the context needed to protect APIs across build, deploy, and runtime phases. This shows their customers all of their APIs, stops attackers during the early stages of an attempted attack, and shares insights to improve their API security posture.

The API security company has grown quite quickly, having become a unicorn after just a few years. They experienced a hypergrowth in customers, which resulted in an abundance of traffic.

Salt’s cloud-native technology is hosted in a Java runtime environment on Amazon’s managed Kubernetes service EKS. To power its machine-learning module, it uses Apache Spark as the computing resource and Apache Cassandra for the data storage.

Read on to find out how Granulate’s continuous workload optimization solution helped Salt overcome its machine-learning challenges, reducing its EKS costs by 40% in the process.

Salt Security’s Learning Cycle Challenge

For the solution’s machine-learning cycles to run efficiently, it was essential that both Spark and Cassandra ran over the same node to maximize stability and minimize latency.

But this meant that, whenever demand spiked for CPU in Spark, there would be a knock-effect on the Cassandra component of the implementation.

As Salt Security scaled, it had to process more and more data. They needed to resolve this issue while continuing to run Spark and Cassandra on the same node without over-provisioning resources, in order to keep cloud costs down.

The solution is based on EKS and deployed over multiple availability zones and regions. Spark and Cassandra, which were used for machine learning, needed to work together flawlessly in order to enable perfect learning cycles.

Every once in a while, they have a learning cycle where Cassandra serves as the data storage and Spark serves as the computing resource. However, they need both to be on the same node for data locality, which improves the job completion time. For machine learning, you really want to make this cycle as short as possible, while remaining stable and with as few interruptions as possible.

As the company grew, they had to process more and more data in the learning cycle. This meant that the Spark CPU on every learning cycle would spike up and affect the Cassandra.

As can be seen in the above image of the simplified architecture of our solution, there are two clusters (here referred to as Datacenters) of Cassandra. One is mostly for the transactions, which takes in all of the data and replicates to the analytics cluster. Every node contains a part of Cassandra and a part of Spark, which together create the learning cycles.

Looking for a Performance and Cost Solution

Once they figured out what the issue was, Salt looked for a solution that would allow them to have Spark and Cassandra run over the same nodes without limiting Spark nor stating any CPU limits, as they didn’t want to affect the quality of the learning cycles.

When it was determined that this was a CPU issue, they first tried to split Cassandra and Spark to different nodes. However, this removed the data locality, which made learning duration much longer.

After that, Salt tried over-provisioning, which is common among DevOps teams that decide that they don’t have the resources to dive deep into an issue. This wasn’t an effective solution because Spark only spiked in CPU when the learning cycle was taking place, which was only 30% of the time. They found themselves with unused resources and over-provisioned machines 70% of the time, which was far from cost effective.

Salt Security now had clearly identified performance and cost goals:

Performance Goals

  • Remove CPU limitations on Spark without affecting other pods on the same node
  • Optimize learning cycle duration and pods availability

Cost Goals

  • Maximize resource utilization
  • Manage pods and nodes based on logical units

Granulate: The Optimization Solution for Big Data Workloads

Big Data workloads, like the Spark workload being leveraged by Salt Security, are an excellent use case for Granulate. Granulate supports popular Big Data infrastructures like Spark or MapReduce, and specializes in optimizing their corresponding runtimes of Scala, Java or Python with PySpark.

“I was going over the slides and I thought, I need to see if it works; I don’t believe it just hearing about it.

But I have to say that the support I got from the Granulate team and the implementation itself was super simple. It didn’t require any abnormal efforts at my end, so it made my life as a developer easier.”

— Gal Porat, Director of DevOps, Salt Security

The Granulate agent was deployed in learning mode, a process which took just a few days. Once activated, the API security company immediately saw improved performance, reduced CPU utilization, reduced latency, and improvement in SLA metrics overall.

For Kubernetes environments like the one being leveraged by Salt Security, Granulate offers a continuous workload rightsizing solution. This free orchestration tool reduces Kubernetes costs by rightsizing workloads and pods when it comes to CPU and memory reservations. It also has the ability to automatically and dynamically change these values in accordance with HPA parameters and scaling policies.

“Sizing container workloads and properly setting resource limits is extremely complex and a time-consuming endeavor. And this is mostly because containers are inherently a multi-tenant system and properly setting and allocating memory and resources is a specialized task.

This is where partner Granulate comes in and helps you, as the customer, understand where and how much CPU and memory you’re using, so you can efficiently optimize your container sizes.”

— Shardul Vaidya, Partner Solution Architect, AWS

It is important to note that Granulate’s capacity optimization solution can be used with or without the optimization agent for better rightsizing Kubernetes workloads. Also, on top of the reduced CPU you are enjoying, the tuning of CPU and memory requests that Granulate provides leads to an additional level of cost reduction.

With the CPU usage reduction that Granulate provided, Salt Security managed to have Spark and Cassandra pods over the same nodes, without over-provisioning. Ultimately, they were able to showcase an impressive cost reduction of 40% improvement.

Learn more to find out how Granulate optimizes Big Data workloads autonomously and continuously.

--

--

Intel Granulate Tech Blog Team
Intel Granulate

Intel Granulate empowers enterprises and DNBs with real-time, continuous workload optimization and orchestration, leading to reduced cloud costs.