Framework to think about Service Limits in a Microservices Architecture

Published in

Palo Alto Networks Developers

6 min readDec 29, 2022

With over 100+ microservices deployed in production to power the Prisma Cloud Platform at Palo Alto Networks, we have had to deal with some basic reality checks as we grew the business from $XM to $XXXM in 3+ years going from tens to thousands of customers.

Achieving a business flow now involves significantly many moving pieces. Qualitative metrics on why a feature does not function now involves many service components.
Up-time and scalability requirements increase significantly with business growth. With as many services things get a lot more complex with microservices.
Incremental feature growth doesn’t always align with the domain-driven-design model adopted by engineering. This is often less talked about amidst numerous theoretical discussions on the good things microservices can bring to a distributed system but I’ll save this for another blog.

One of the key components in maturing a microservices architecture is to be able to articulate its limits.
Looking around for standardized definitions and mechanisms to go about formally defining service limits, we found none of note that was relevant to our scale. We took this as an opportunity to introspect on the right framework that works for us and believe its generic enough to be applied to any microservices architecture.

Definition of “Service Limits”

“Service limits” can be defined as those metrics that can articulate a microservice’s boundary with respect to its functional use cases (business context), scalability, cost and performance requirements.

Need for Service Limits

Be able to define a service boundary to maintain predictable scalability and performance.
Be able to communicate the business context and the criticality of the ‘user journey’ this service is accountable for.
Be able to inform and optimize Performance and System testing regressions in a fast changing microservices environment.
Be able to inform and contribute to our Earnings/Revenue ratio by ensuring efficient spend on scaling such limits with good business justification.

Notion of Service Chains and Business Context

One of the key pre-conditions to understanding some of the Factors that contribute to service limits is the concept of Service chains.

“Service chains” are sets of microservices that are expected to deliver key Business use cases. A good example of this is a an Ordering Service which may have a dependency with a User Service. To communicate the limits in business terms for the Ordering service it's critical understand where this service fits in the broader architecture.

In the above example, the service chain of User Service -> Ordering Service forms a chain to communicate service limits for the Ordering Service. E.g. Number of orders placed/sec per region = 100. This limit while specific to Ordering Service is effectively the RPS allowed per region, it is communicated with respect to the Users placing the orders.

Factors

The below diagram summarizes the top factors that’d impact each microservice in a service family. We’ll use this diagram to introduce new definitions to help standardize our understanding of Service Limits.

The 4 Golden Pillars for defining Service Limits

Qualitative Growth Scale(QGS)

This helps us predict the limits of a service based on where it fits into the broader architecture. We do not talk about these limits in engineering level granularity of individual services (e.g. RPS/QPS) but is more directly related to the business context and a way to project and predict growth.

E.g. An ordering service scales by number of deliveries and not just by number of users using the system who may or may not place an order.
In many cases we may not be able to accurately predict/present the growth scale if the services is deep down the microservices stack but it’s helpful to articulate the specific business metric this service is accountable for.

Quantitative Limits

These are the traditional limits most engineering teams are used to providing when dealing with scalability limits of current system.
In an ideal case, these metrics are deduced from the Qualitative growth scale defined above to ensure it addresses the broader business context.
E.g. if the system scales by number of order deliveries, and each order delivery deals with 2 DB writes and 3 API calls, then we how many deliveries can we support in one day assuming current QPS and RPS limits?

COGS Limits or “Cost of Good Sold” Limits

In an ideal world, any scale is achievable provided there is enough allowance in being able to spend on infrastructure (and Human) resources.

While many growth stage businesses may not pay too much attention to the spend incurred or even venture down the microservices architecture, it’s imperative to understand how much scale is worthy of it as the overall architecture evolves into full blown microservices.

Given the Qualitative Growth Scale and understanding the cost of scaling, the service team should be able to deduce the ROI on such cost. Ideally this should raise the discussions with Product Mgmt and Engineering leadership on validity of such growth and its pricing model.

Dependency Limits

These mainly deal with dependent microservices and their limitations. Their limits can also be assumed to be defined in the same framework as indicated in this document.

With the Qualitative growth scale understood, we can now enumerate Service chains that achieve critical business outcomes and those in-turn become dependent services. The chain is as strong as its weakest link so ensuring cross service chain service limits being well understood is critical.

E.g. Ordering Service supports 10k RPS but the Banking application only supports 2k RPS. In this case, it’d not be prudent to promise a scale beyond 2k RPS unless the Banking application is also able to scale.

Except for Qualitative Growth Scale (QGS) factor, all other factors can be deterministic in nature assuming they all align to QGS.

Tying it up — How do you use these Factors?

Now that we have some context into each factors, below we can look at how it can be actionable.

Actionability Matrix For Service Limit Factors

Service Limits Definition Checklist

The proposed framework has been a good starting point for us and as with any framework it needs constant iterations and tuning based on the subjective nature of the software business, engineering culture of the organization and the architecture evolution demands.

The direction to service teams can be reduced to a checklist as follows:

Enumerate all ‘User journeys’/business use cases that are directly or indirectly supported by your Service. This should ideally be drive by ‘Domain Driven Design’ model for microservices.
Rank the above based on business priority.
Translate each of the above into individual quantitative metrics such as number of queries or number of requests to your microservice.
Review current capacity plan for existing services and current QPS/RPS or similar quantitative metric that’s relevant to the above quantitative metrics. (Q)
Baseline existing COGS if not done already for your service needs and above data points.
Document service limit as Q based on current capacity.
For all dependent services (first degree only), identify respective limits relevant for each use case listed in step 2. Track the minimum limit for each service for each ‘user journey’.
Use the limits from last 2 steps to deduce the overall system limit for this service for each user journey/business use case.

I deal with Macro-services — This seems really hard to do!

What’s typical with most architecture evolutions of a growing business is that most organizations start out with macro services (sizable set of functions and features handled by one service) and eventually decomposed into smaller well defined services. In this journey, it’s expected that being able to define service limits are going to be really hard. This problem is further exacerbated with lesser separation of concerns in a macro-service environment.

The intent would ideally be to work backwards in terms of ‘User Journeys’ and prioritize user workflows and how limits can be applied towards those journeys. Following this model, we are no longer tied to either micro or macro service models and we could be anywhere in the spectrum between true microservices vs Domain oriented services as is becoming popular.

The framework, based on our internal trials so far seems to be true to the overall aspirations of a microservices architecture which is seeking loose coupling and highly aligned units working together in the most efficient way.

References

Martin Fowler’s DDD for Microservices — https://martinfowler.com/bliki/DomainDrivenDesign.html
Scalability and Performance — O’Reilly’s Production Ready Microservices — https://www.oreilly.com/library/view/production-ready-microservices/9781491965962/ch04.html