Are You Fair to Your SaaS Neighbor?

Boris Livshutz
6 min readNov 30, 2022

--

A customer-centric approach to SaaS, Part 3

In my previous two blogs posts, I discussed the reasons your business must have visibility into the tenants on your SaaS applications. Now I’d like to address a very tough and important challenge all SaaS companies face: what to do when your systems get overloaded during busy times.

The easiest thing to do when your system is struggling is just ignore the tenants; reject all incoming requests and then hope for the best. Tenant visibility doesn’t even come into play. The problem with this approach is that you might be ensuring a terrible experience for all your customers when only one of them is overloading your systems. This is known as the “noisy neighbor” problem. Your ​​system has a noisy neighbor problem when one or a few of your customers place such a tremendous load on your system that it adversely affects all customers using the platform.

But it doesn’t have to be this way; a system can be designed such that the usage of one customer doesn’t have to hurt the rest of your customers. In this blog post, I will discuss various approaches that can do just this.

At this point, you may be thinking you don’t need to worry about noisy neighbors. If you already do auto-scaling, you use features like AWS Lambda, etc. Isn’t that enough to adjust to varying load, regardless of which tenant is causing it? Well, as many companies have learned the hard way, auto-scaling is extremely hard to get right, and most smaller companies have a hard time making it work.

Even if you do master this advanced technology, and successfully auto-scale at the right times, it won’t prevent all overload issues; during a spike in load, your new capacity won’t magically appear instantaneously. First, your system will take time to detect the heavier load and ensure it continues long enough to justify scaling. And even then, after your scaling kicks in, each new instance has to go through several steps to be ready to serve the load:

  • Each new instance first has to be assigned to you by the cloud.
  • The instance is then bootstrapped with its virtual resources.
  • Next, the applications have to start up, caches have to be warmed up, connections must be established, and so on.

This process will take at least 5 minutes to be completely ready, in the best of times. By that time, a big request storm might have already brought down your system. And your system can get overloaded even more just spinning up all those new instances at once.

But what about Lambda…can’t it scale infinitely? Well, there may not be a technical limit on how many Lambda instances can spring up suddenly, but what about the cost? And what about the downstream services it will end up calling? Many budgets have been wiped out in the first month after launching a system that has unlimited use of Lambda, because you are charged for each time you execute the function as well as for the time spent in executing the function. Once the budget folks come at you with their pitchforks, you will then put in strong limits on when Lambda functions can be executed, and this will bring you back to the original problem of not being able to handle load spikes.

But even if you have the budget and start invoking Lambda functions en masse on each load spike, the usual result will just be congestion further downstream. The functions will probably involve database calls, queue requests, use of physical resources such as a disk, and so on. Yes, the functions will get invoked without delay, but they will eventually end up waiting on the database storm they created.

So just like auto-scaling, Lambda is not a magic bullet and does not alleviate the need for a real solution to the noisy neighbor problem. So let’s explore some other companies’ DIY or open-source solutions.

Many companies use the most basic solution, called rate limiting. They simply place a limit on how many requests a service can allow at any given time, and reject any requests above that limit. While this solution is easy to implement and configure, it is not very effective at solving the problem. Usually, to be most conservative, limits are set so high that they only trigger in the most extreme DDOS-like situations. But if a company wants to be less conservative, instead of setting a very high limit, they leave it to the operations teams to set limits based on actual data.

The difficulty is that the optimal limit is a moving target that changes with every software update, infrastructure change or just changes in load mix. Setting limits in constantly changing environments becomes cumbersome, time-consuming and error-prone. Because of all this, most operations teams eventually resort to the first solution, which is to set it to a constant high value, leading back to the same problem we are trying to solve.

As it turns out, rate limiting exercises simply don’t limit load. Ultimately, either performance will degrade during load spikes, or you will have to grossly over-provision resources to handle any load storm. Over-provisioning wastes money, because the business is paying for a large amount of unused capacity.

The other problem is fairness. At some point, one of your customers might explode and be responsible for most of the requests on your system. By stopping all requests that exceed the limit, you are most likely rejecting requests from those customers who are just lightly using the system. Meanwhile, the noisy neighbor tenant is greedily getting most of its requests serviced.

So while the traditional simple solution tries to prevent overload, it is far from ideal, due to higher costs and allowing a bad tenant (noisy neighbor) to hurt everyone. But many companies, especially smaller ones, don’t have the internal expertise or resources in engineering to implement systems more intelligent than this.

Some very large companies have built their own advanced solutions to this problem. There are a lot of articles, talks, and blogs from companies such as Netflix, Amazon, Wechat and even Lyft. These companies have enormous resources and have found thoughtful solutions to address the problems of the basic solution I mentioned before. While these advanced solutions vary, the common ideas are as follows.

They all try to identify the capacity of each system and use certain metrics (usually request wait time in queue) to decide when to shed load. Then they try to decide which load to shed, based on priority of the calling service and fairness to each user. For example, they would rather kill trivial background tasks than financial transactions, and they don’t want to kill requests from the same user each time. Instead, they try to spread out the pain.

Another important feature is that these solutions, when running on each service, communicate with each other. Making decisions in a coordinated fashion helps to avoid retry and re-login storms. This feature also limits wasted work by not killing a request very deep in a call graph.

These solutions are quite impressive engineering feats, and they are indeed effective ways to deal with the noisy neighbor problem. But if you are reading this blog post, they are probably out of reach for you. They require immense engineering resources, teams on top of teams, and constant updates to account for changes in the underlying applications.

If you don’t have a massive and sophisticated engineering organization, you might want to try finding an open source library that can help. Libraries such as Kanaloa and concurrency-limits (Netflix) offer a few of the features I’ve discussed above. While they do help, they are hard to maintain, as they also require customization and configuration, which sometimes need to be modified or reconfigured as your environment and usage patterns change. They are not dynamic enough to be plug and play.

I know this wasn’t a very uplifting blog post for those of you who were hoping for a quick fix. Nevertheless, I hope this content has helped you understand that problems with heavy load and the never-ending noisy neighbors on your system are not easy to solve, and that you should give them more of your attention.

I hope the big takeaway for you is that managing dynamically changing load is intrinsically difficult because you are working in an environment that is never static; your customer’s load pattern is changing, your product is changing, and your infrastructure is changing. To address such a complex problem, any solution must also be highly dynamic!

But all is not lost. In my next blog post I will go over what a comprehensive solution should look like. If you have any ideas or suggestions on how to elegantly solve this problem, please write to me and let me know.

--

--

Boris Livshutz

Angel Investor | Technical Advisor | Serial Entrepreneur