Navigating Kubernetes Complexity (Part I)

Nelson Gomes
Pipedrive R&D Blog
Published in
12 min readMay 14, 2024

In recent years, we’ve witnessed a growth in K8s clusters, and we often see services and the companies using them… failing. These applications start small, grow organically, and then start to become slower. Debug complexity increases, customers start to get frustrated and complain and companies experience growth problems, disappearing almost as fast as they appeared. How?!? Why!?! This series dissects the challenges faced by enterprises in managing Kubernetes clusters, offering practical solutions and expert insights. From mitigating exponential call counts to controlling socket events traffic, each article equips readers with proactive strategies to optimize performance and maintain stability.

Keeping K8s afloat

The root cause for K8s troubles is quite simple: complexity! And we can’t blame developers - applications built on top of K8s aren’t regular applications but form part of a complex distributed app. Distributed systems are far more complex than having just a few hundred services working inside a Kubernetes cluster, and it can be very difficult to grasp the overall picture.

Complex K8s architecture

Usually, as a system develops, it grows exponentially in complexity. At a certain point in its life, it gets flooded by internal or external requests that generate a massive load on the cluster.

I like K8s. My intention isn’t to shame it, so bear with me as I walk you through some situations that happen in live systems, why they occur and how to prevent them.

1. Cutting exponential call count

Almost all K8s apps are comprised of many services, including some generic app management services like these as an example:

  • Customers — to check customer state and plan
  • Users — to check a user state
  • Permissions — to manage permissions

These shared services are usually requested by multiple services to check if they should stop working on behalf of that customer or not. This applies to incoming requests but also for internal services.

So, if we created an API service that could call other services, a simple request would become many requests.

Total amount of requests created from a single request

For example, if incoming requests (n) created five requests at the first level (#1, #2, #3, #5, #7) and generated three additional requests at the second level (#4, #6, #8), the total number of requests generated would be n*8. This is a very simple example. If you dug deeper into the code and trace it you’d find many more hidden requests (DNS, database queries, APIs and so on).

If you measure everything you will discover that your application is far more exponential than you expected.

The exponential effect of requests would create an internal load eight times bigger than the initial request, which could crash our app.

Let’s do the math: imagine your cluster could handle 1.2M internal requests. Every day, your cluster handles 100k external requests, which generate eight times more internal requests (roughly 800k internal requests). This would mean your cluster had room for 400k more internal requests, which might seem like a lot, but an extra 50k external requests would be enough to reach breaking point. More interesting is the fact that with every internal request you cut, you step away further from that breaking point, which means doing more with less resources and making it less exponential.

So load at this point is non linear any more, it’s exponential and some extra requests may be enough to crash parts of the app and make it unresponsive and unstable.

How can we prevent this from happening? Luckily there are several easy strategies we can try:

Use context propagation

Context propagation lets you inject and propagate reusable information between requests using headers. This allows you to propagate a set of data to all services that make requests instead of having all services request the same data.

You might be familiar with near caches, where data only has a few seconds of time-to-live (TTL). We can use context propagation as a source of data passing data in headers with a TTL matching the starting request’s timeout. The client should refresh the data when its TTL expires before propagating it to further child requests, allowing headers to be propagated to other systems, like Kafka.

For example, if we propagate a customer’s information (#2) to child requests, we can save three requests downstream.

Passing data between requests via context propagation, first part is Unix timestamp hex encoded TTL

This lets you propagate a previous request downstream:

Passing commonly used data in headers saves three requests in this example.

Apply near caching

Although best practices state that information-providing services should be responsible for caching and managing data, near caches are essential for highly called APIs because we’re usually just calling the same methods repeatedly. Although we can scale instances, nothing scales indefinitely, so near caching will stabilize your cluster.

Near caching is a technique consisting of a short-lived (sometimes for a few seconds) client-side cache. If you define global caching policies in your general-purpose HTTP client, all services will use them for common requests. This can save a bundle of requests on an already saturated infrastructure and reduce costs considerably.

Keep in mind that requests usually come in bursts. When a customer starts to load a page, it usually makes several requests. If a few requests were made before the current one (#1 and #2 in this example), our cache was already warmed up, so we can reuse that information in the following requests and save some calls.

So, using very simple techniques at scale, we saved five calls out of eight - and trust me, real systems are far more complex than this example, so the benefits can even be bigger. So we saved 5 out of 8 requests cutting on complexity and making it less exponential.

Using near caches saves calls between consecutive requests 2 in this example.

Log your call traces

Logging your call traces will allow you to detect and analyze the steps taken from the beginning to the end of a request and see which calls are being made. Grafana Tempo is one solution that lets you store and query your traces.

Grafana Tempo

Execute requests in bulk

When there is the need to call same API multiple times, do them all in a single request because you will save resources and save on network time.

Rate limit requests (HTTP 429 ‘Too Many Requests’)

Impose acceptable global, customer and user-specific rate limits to prevent abuse that may impact the cluster’s usability. Adding a web application firewall (WAF) is also necessary because firewalls nowadays have a minimal effect due to HTTPS-ciphered traffic pushing the decision to execute a request to the server rather than the firewall.

With these in place you can prevent abuse and control the load your K8s cluster receives.

Abuse prevention flowchart.

This way, WAFs can block undesired requests. Rate limiting at different levels also controls request abuse by customers and their users, ensuring the cluster doesn’t handle too heavy load. Of course, we don’t want to unnecessarily block at the cluster level, but if we reach that point, the cluster may already be unstable. Hence, throwing a 429 status code is a better alternative than failing all requests massively.

2. Control socket events traffic

All major web applications currently receive server-to-browser push notifications. While I reckon it’s awesome that browsers can receive near real-time data, which makes apps look and feel like traditional desktop applications, we have to keep in mind that they’re not.

Sometimes, we also make false assumptions about an app’s capabilities.

Our server can process thousands of events from our backend.” Right… but can your customer’s browser handle that? Customer machines aren’t servers. They don’t have the same processing power and memory as your servers. Trying to mimic that same processing power on the client side is a recipe for failure.

Sending too many events to browser may cause overall slowness.

So push–notifications must be done with a best-effort assumption: we try to deliver events to the frontend but don’t guarantee it.

In scenarios where customers generate millions of updates, you shouldn’t even try to deliver that amount of events to their browser. Doing so would send a massive trove of data to the browser that could take minutes, or even hours, to flush, which could also impact network latency. After all, not all locations have good bandwidth and not all computers are powerful.

Finally, delivering so much information to a customer’s browser can be pricey because some data centers charge for outgoing traffic.

Sometimes we have components listening to those events, getting updated, and eventually making other requests. A few updates generate even more requests, creating a snowball effect. Occasionally, we hide components but keep them listening, which could also add to it by generating invisible renders per event.

So, what should we do instead?

Rate limit outbound traffic

Pushing too much data will slow down a customer’s browser and network and increase costs. So we need to make sure we keep that data at acceptable levels. If a specific user has multiple connections due to multiple browsers or tabs, divide the rate limit by the connection number so you can control the amount of data being pushed to that user.

Drop events server side

Drop events older than a few minutes, this means discarding data the moment it starts piling up. And how about critical events? We should treat essential data as non-disposable, but only for small batches of critical data that we want the customer to receive on their browser. For example, operation confirmations and similar events should have a priority lane that allows them to arrive before other events.

When events start pilling up we need to start dropping them for UI sanity

Send all needed data

All needed data should be part of the event, frontend should NOT make any further calls after receiving a notification. If it does, it generates nonlinear requests based on events. If it gets one event, it makes one request, but if it receives 100k… you’ll be in trouble, especially if you have thousands of active users.

Make sure you use an API client that manages requests for you, have a near cache on the browser side to maintain data for some time and avoid repetitive calls for the same data, preferably with some global configuration.

Avoid doing event-triggered requests as much as possible.

Put hidden components on standby

Receive data but skip any renders until they’re made visible again and re-rendered one final time. If you’re sending out multiple events and rendering several invisible components, you’re impacting your browser performance by rendering useless components.

Single-page applications can also have performance issues if bad architectural decisions are made. A good option is to separate data from its view to update it without rendering the view.

Any hidden view in your browser must be passive when not visible.

3. Self frontend DDoS (distributed denial of service)

I sometimes see developers unaware of the impact they may cause. Here are some small things that pop up during discussions and meetings.

“I just added one extra request to my component.“

Just one right? In fact it may depend on the amount of times that component is shown to different users. It’s not the first time a ‘single’ extra request crashes a feature, because that request generated too many requests and the feature wasn’t prepared to handle such an amount of requests. When it’s deployed to live it will crash almost immediately. It also depends on what page is being used and its traffic amount.

“My react component had an unforeseen case in its inner state, which led to an infinite loop of API calls due to continuous rendering.“

Now multiply this by thousands of customers, and everything would crash. The funny thing is that even if you rolled out a fix, it would take some time until all browsers detected it and refreshed Javascript, which makes this kind of incident difficult to fully resolve.

“We decided to cache some components in the browser DOM for speed.“

In theory, this is fine. In theory!

What happens most of the time is that those hidden components keep working in the background, listening to events and rendering, even though they’re invisible. Worse, sometimes they keep making requests that aren’t needed. When this happens repeatedly, it generates enough requests to slow everything down. Combine these events with hidden components and API calls, and you get a perfect storm: slowness on the front and back end.

A single request shouldn’t be problematic, right? But at scale, with thousands of users, a request becomes thousands of requests. The size of its impact depends on several factors. Is the component used on a main page or one with heavy traffic? Can the backend handle such requests? Does it have caching in place to avoid hitting the database? Does the database have replicas to handle such requests? Is the frontend code coverage good enough to prevent unexpected scenarios?

Let’s dive into how we can mitigate these problems.

Prepare your feature to handle the proper number of requests

Add some caching to it, do some math and scale it accordingly. Don’t assume the database can handle all requests without any help.

Frontend components need good code coverage

Otherwise, they won’t work well in all scenarios. This could cause infinite loops, which could crash an app. These infinite loops are very difficult to detect and revert because Javascript code needs to be invalidated in all browsers, which takes time.

Have service or endpoint metrics with adequate alerting

Calculate the weekly moving average of requests an endpoint usually receives during a week and divide it into hourly chunks. Now, use an hourly moving average.

Weekly per-hour requests, which are very stable (yellow), and hourly requests (green)

To calculate the hourly request count, you can use:

sum(increase(service_web_request_count{service="service"}[1h]))

For the weekly per- hour request count, you can use:

sum(increase(service_web_request_count{service="service"}[1w]))/7/24

By dividing both values, you’ll get the expected maximum and minimum ratios that are adaptive and follow request fluctuations over time. This ratio signals how often hourly requests go above or below weekly requests.

This chart shows how often hourly traffic fluctuates over and below hourly traffic.
(sum(increase(docbase_api_web_request_count{service="docbase-api"}[1h]))) / 
(sum(increase(docbase_api_web_request_count{service="docbase-api"}[1w]))/7/24)

Now, you can create alerts for these requests. For example, if you set a threshold of 3.5 or even four, you’ll receive an alert for any unexpected growth in traffic (you can see in the chart that last week, we never surpassed three). If you set an alert based on request count, you’d need to update the request number every time the traffic pattern changes.

Creating this alert will allow you to monitor traffic changes that would be difficult to detect otherwise. This is especially true with our self-generated DDoS, which could take weeks to detect, even if it is legitimate and authenticated traffic.

Final notes

K8s is a great tool that’s sometimes badly used and poorly understood. It’s easy to do something with it but far more complex to scale it the way we want, and it’s very easy to be overwhelmed and confused when things don’t go as expected.

When using K8s, my advice is to use common sense. Think outside the box, ask yourself what might be happening and how you can improve things, and, of course, think at scale. It’s never just a request. It’s never just a component. Everything can have a big impact if poorly optimized or thought of.

It’s also essential to have the proper tooling in place. Use Tempo for tracing, Loki for logging and Grafana for metrics and alerting. Without these tools, you’re basically blind. For K8s to work, you need to instrument your backend with OpenTelemetry and frontend with a RUM (real user monitoring) tool like Faro.

Happy Kubernetting.

(to be continued…)

References:

https://www.meetup.com/pipedrive-talks-lisbon/events/299275446/

https://medium.com/dev-beinfra/k8s-pt-4-deployment-istio-aks-33201db9156a
https://www.rawpixel.com/image/12649860/bomb-explosion-effect-png-transparent-background

--

--

Nelson Gomes
Pipedrive R&D Blog

Works in Pipedrive as senior SRE, has a Degree in Informatics from University of Lisbon and a Post-grad in project management.