Navigating Kubernetes Complexity (Part II)

Published in

Pipedrive R&D Blog

7 min readJun 25, 2024

In the previous part of this article, we discussed that the root cause of many issues with Kubernetes applications is complexity. As applications grow, they become more complex, leading to increased debugging challenges, customer frustration, and stability problems. To solve the challenges faced by K8s clusters, I suggested strategies such as context propagation, near caching, and rate limiting. These simple but important steps help reduce unnecessary internal requests, control socket events traffic, and prevent self-inflicted DDoS attacks, ultimately optimizing cluster performance and resource utilization. I was even reached via LinkedIn by someone having exponential requests on their AWS Lambdas, so today, I will continue addressing other cases that inflict slowness.

This series dissects the challenges faced by enterprises in managing Kubernetes clusters, offering practical solutions and expert insights. From internal service abuse and database protection to custom field indexing.

4. Common services abuse

As discussed on point 3 ‘Self frontend DDoS’ of the previous article, large and complex systems usually have a few shared services, and we tend to forget that our services are not just made of public endpoints. We also have inner services to process event data, which generate load and use common services.

Some examples:

Bulk updates — when a service calls other APIs thousands of times to execute an operation.
Outbox patterns — these allow us to track database changes and trigger events that can send data to browsers or call webhooks for integrations. Webhooks are usually followed by more calls triggered by external integrations.
Services that monitor data changes and push those changes to other services
Large synchronization tasks, like calendar, contact or mailbox synchronization
K8s cronjobs that execute periodic maintenance tasks

Depending on how controlled these services are, any of them could have a significant impact. So, how can we minimize this impact?

Rate limit service to service calls

Control how many calls a service can make per second by using, for example, sidecars like Istio. Forget about ego “our service is near real-time” — if your service can cause big load spikes, it must be kept on a short leash, even if it takes minutes or hours to complete its task.

Near caching strategies (client-side)

Although service-providing data should manage its own cache, near caches can reduce load considerably but more importantly they help to stabilize your cluster when a burst of requests can occur. Using them reduces the needed amount of requests and minimizes the cost of having a large instance count.

Force services to respect HTTP 429 “Too Many Requests“

You can use internal exponential backoff on requesting service until called service load decreases and requests start being responded successfully.

Shared caches

You can check the cache for content even before calling the API. This is not the typical pattern but can cut request amounts considerably, especially if we use cache clusters.

Add monitoring

You can add monitoring for service-to-service calls in Grafana, so you always know what’s happening in your K8s cluster. Monitor traffic amount per for pattern changes especially for all ‘Tier 1’ services, and compare hourly requests (green) with last week’s traffic split in hours (yellow):

Compare hourly traffic (green) and weekly traffic split in hours (yellow).

Divide per hour traffic with weekly traffic split in hours and you’ll obtain a ratio:

This provides a stable amplitude of your traffic patterns.

This ratio indicates that hourly traffic oscillates between 0.2 and 3 fold of weekly traffic split in hours. With this you can add alerts for example if this traffic ratio hits a 4, which is far better than putting an alert for 50k requests and adjust it every few weeks when traffic grows, making it an very adaptable metric.

As an SRE, I find it much harder to detect internal load than external. Why? We protect our front doors more than our inner doors. Simple.

It’s also good practice to distinguish what we call tier-1 services. These are the most critical services that may cause your infrastructure to stop responding in the event of a crash. Make sure you identify these and protect them more than regular services.

5. Protect your databases

Many services have a dedicated database, and requests make queries to it. After all, some services don’t receive many requests. But what happens if any of the endpoints is abused (intentionally or not)?

We tend to forget that a database can process a limited number of queries per second. So, it’s easy to crash an unprotected database quickly: make a large number of requests to any of the endpoints, and it will eventually stop responding when its query limit is reached.

Here are some ways you can avoid this.

Automatically log slow queries

By automatically logging slow replies to Loki, you can see how many and which queries you have from each database and add Grafana metrics to know each query count.

Metrics allows you to detect pattern changes in queries.

Logging slow queries helps to identify problematic queries.

Use a cache from the beginning

A cache is particularly useful if you know your service will be used on any of your main pages and receive many requests from the start. Not having a cache is a recipe for failure, and I’ve seen many examples of this, typically the endpoint will crash within a few seconds after being deployed given the amount of requests.

Rate limit endpoints

That way, you can control the volume of requests a single user can make and prevent database from crashing.

Separate reads from writes

When querying live databases, it’s vital to separate reads from writes. Write to the primary database and read from the replica database as much as possible. You can even have a pool of slaves to query to avoid burden your primary.

6. Custom fields indexing

Most SaaS services offer a custom fields feature that allows customers to create their own columns. Unfortunately, custom fields make querying slower because these columns aren’t indexed by default, especially for large datasets.

If you also allow your customers to create filters, slow queries are guaranteed, which is a very difficult problem to fix.

So, how can we make it work?

Collect and log all slow customer queries and the filter ID causing them.
Check cardinality for each custom field used in the filter.
Automate index creation using the custom field with less cardinality (less cardinality means less traversed rows). Do this during low usage periods to avoid affecting database performance due to querying and index creation.
Check and drop unused indexes periodically — too many indexes can make a table slower on inserts and updates.
Repeat this every week for every customer with custom fields.

This way, we can detect slow filters and manage indexes needed before customers start complaining — and we want happy customers.

When creating filters, there are some basic rules we should follow.

Don’t let customers be overly creative

Allowing customers to do everything they want is NOT a good idea.

Don’t allow customers to search text columns for contains (%contains%). These queries force a check on ALL rows of a table and are non indexable. Instead, provide customers a search for fields that match a value (=) or start with a value (match%). These searches are indexable and much faster to execute.

Organize filters

Organize filters by indexed fields and clauses to optimize queries. Order them by expected cardinality, less cardinality should come first.

Don’t allow functions in “where“ or “order by” clauses

Functions are non indexable, forcing a full table scan. The same happens when you order by fields, especially if you’re working with pagination, forcing large tables to be ordered to disk on every single query.

Calculate dates once

You can use these calculations to query your database. If you use date functions, date indexes are useless. Convert dates to UTC:

deal_date > CONVERT_TZ(‘2014–02–02’, ’Europe/Lisbon’, ‘UTC’)

Rather than converting all rows to a customer’s timezone:

CONVERT_TZ(deal_date,’UTC’, ’Europe/Lisbon’) > ‘2014–02–02’.

Remember that you can’t manually index a database using custom fields and filters. It’s unrealistic to continuously manage thousands of databases performance - focus on being dynamic, which will considerably reduce customer complaints about slowness. In some situations, a few indexes can make database servers less saturated and improve most (even non indexed) queries.

Key takeaways

Instrument and monitor all systems from beginning to end to understand what’s happening. This includes proper logging on the backend and frontend.
Think outside the box. Having CPU power doesn’t mean you should abuse it.
Don’t trust external or internal clients, and protect your services from any kind of abuse.
Test things thoroughly. Make estimations, and if you can’t, roll features out to smaller user subsets.
Ensure that frontend components are properly code-covered and that you have clients to abstract API calls with near the cache. Make sure hidden components are effectively quiet when hidden.
Expect the best from your services, but prepare for the worst.
Ensure you don’t make unnecessary calls. Any extra call increases your service’s complexity and load exponentially.

Finally, share knowledge across different teams. Discuss what to do and what not to do and the benefits of proper code coverage. The biggest benefit will be making developers more aware of the impact they might have.

Happy kubernetting.