Is Your SaaS Product “Tenant-Aware”?

Boris Livshutz
6 min readApr 14, 2022

--

A customer-centric approach to SaaS, Part I

THE PROBLEM

Back in 2010, when I was working at a small startup with a tiny engineering team, I found myself in charge of building a SaaS version of our product, which had previously only been available in an on-premise version. This was no easy task, but after quite a bit of on-the-job learning, homebrew tooling, and many mistakes along the way, we had a successful SaaS offering. To our surprise, however, that was just the beginning of our problems. As we soon learned, our SaaS business was all about the customer, but our platform and tooling were all about the resources. In other words, we had no visibility into the customer (or in SaaS-speak, the tenant). Our platform and tooling were tenant-unaware!

Because our startup itself was an APM vendor, we knew how to monitor our own systems really well; we had rich insights into how our own software was running, but nothing about our tenants’ systems. We had great technical and business metrics about various components of the technology stack and infrastructure, but we had no metrics on our tenants, our customers.

BENEFITS OF IDENTIFYING THE TENANT

Why is it so important to be tenant aware? From a technical standpoint, you must be tenant-aware so that you can:

● quickly detect outages that are caused by or affecting specific tenants,

● determine SLA compliance,

● enforce tenant isolation, and

● track overall tenant utilization and performance across your services and tech stacks.

On the business side, tenant-awareness is required if you want to build a customer-centric business model with the ability to:

● know the operational cost of each customer

● provide discounts,

● manage billings, and

● measure customer satisfaction.

And of course, this is only a partial list of the business data that drives top and bottom line business optimization.

Those tenant-aware benefits are made possible by first being able to collect tenant-specific metrics as you monitor your SaaS cloud and tenant activity, and then making the metrics actionable.

● To DevOps, actionable means tooling to enable tenant throttling, tenant migration, and other related tenant management tooling.

● For the business, actionable means billing based on customer utilization, renewal rate optimization from customer satisfaction detection, and even increased product adoption by looking at feature utilization.

THE SOLUTION

Ok, you are convinced — you must become tenant-aware. But how do you do that? I wish I could say that you just buy one of 20 products out there, install it, and voilà, you now have brilliant dashboards that give you all of the data we just discussed plus control mechanisms for operations and maintenance. Sadly, after decades of growth of SaaS, there is no simple vendor solution currently available.

But wait! We all know that every DevOps team runs dozens of monitoring tools that show the health and performance of systems. When systems degrade, team members quickly try to find the root cause of the problem. However, these efforts work well only when the root cause is a faulty piece of hardware, a configuration error, or buggy software code, all of which DevOps can address. If the root cause instead resides in a tenant’s environment, those monitoring tools can’t offer much help. So unfortunately, that leaves you with the age old solution: gain visibility into your tenants through your own code and tooling. Let’s look at one possible way of doing this

IDENTIFYING THE TENANT

A common and relatively simple solution exists, if you are willing to involve developers and change code. Developers must add a unique tenant ID in some environmental context so it is always passed around and visible in every service-to-service and service-to-resource interaction. Once you have changed the code and can identify each tenant, you need to actually capture metrics in production, so you can calculate the two most important metrics for tenant monitoring:

● Frequency — how often a tenant is utilizing a service or resource

● Response Time — how much time a service spent responding to a tenant’s request (on average)

Since DevOps already monitors your systems with all sorts of monitoring tools (or uses their own homebrew monitoring implementation), just extend them to capture the tenant ID and break out the data they already report by each tenant. From this data, you will have access to the two metrics mentioned above, and DevOps can graph, analyze, and alert on this in production.

All of you engineers out there will probably now point out that this seems too much of an oversimplification and doesn’t tell you everything about the tenant. It is true that there are many more metrics we could come up with and monitor, such as looking at low level resource utilization, function level analysis inside each service, and many more. However, these can require a massive amount of engineering and overhead cost while adding only minor utility to our goals. Most problems that tenants cause can be discovered by focusing at the high level services with only these two metrics. Not to mention we have plenty of resource monitoring tools already that track such lower level metrics.

ROOT CAUSE ANALYSIS

Now that you have your key metrics, how will you use them to be tenant-aware? Let’s turn our attention back to DevOps, whose job it is to keep an eye out for service problems and, when a problem does occur, to quickly find the root cause and fix it. Let’s look at a couple problems that DevOps can now quickly find with these new tenant metrics.

Noisy Neighbor Problem

This problem occurs when one tenant, potentially even one of your smallest customers, increases their utilization of your services by such an unanticipated degree that it causes performance degradation for some of your other tenants. Looking at both of our new metrics, DevOps can determine if this is simply due to an increase in tenant requests to one of the services (the first metric, frequency) or if the issue is something deeper. Perhaps the tenants’ data size has grown and each service request takes much longer to fulfill (the second metric, response time).

Obnoxious Neighbor Problem

This problem arises when the noisy neighbor gets worse and starts to severely impact many (or all) tenants, and potentially even endangers the stability of your services.

MAKING IT ACTIONABLE

OK, so let’s say you’ve managed to get visibility into all this. Now that you notice problems that tenants are causing, what do you do about it? There are technical and business solutions to some of these problems. For now, let’s focus on the technical side.

Going back to the noisy neighbor problem, a tenant that is dangerous to others can be moved to their own environment and isolated. In less extreme cases, tenants can be re-balanced on environments so that no single environment has more heavy tenants than it can support. Ideally, you already have tools to reassign tenants to environments and resources. If so, you can provide your operations team with dashboards to identify noisy neighbors and link that with the tools they can run to immediately resolve these issues as they come up real time to limit the impact to others. We will discuss how to implement and automate these types of actions in a future blog.

And these are just the technical issues we can solve; we also have to look at the business solutions to tenant problems, in the areas of finance, budget, and pricing. Again, we will explore this in future blogs.

THE FUTURE

I hope this blog post has inspired you to learn more about your tenants and try to be tenant-aware. As you can see, it’s relatively straightforward to get started, and the benefits are immense. And so far we’ve only focused on the technical aspects. Stay tuned for future blogs where we will dive deeper into the business aspects of tenant awareness.

I also didn’t discuss automating remediation to the tenant problems; we’ll dive more into that in the future as well.

Finally, while I claimed that there are no off-the-shelf solutions to this (and just made you go through all those code changes!), there are new solutions being worked on in the industry. We will also explore some of these in future posts. Stay tuned!

--

--

Boris Livshutz

Angel Investor | Technical Advisor | Serial Entrepreneur