High Scale Web Server With Kubernetes - Part II

Itay Bittan
Jan 30 · 6 min read

This is the second part of High scale web server with Kubernetes. We will go over Kubernetes Horizontal Pod Autoscale and how we are using it at Dynamic Yield.

Overview

While serving a huge amount of requests, we can easily observe that our traffic graph looks like a sine wave with a high rate at midday and a lower rate at night. The difference is relatively big, around 2–3 times more requests in the rush hours. Moreover, there are special occasions such as Black Friday, Cyber Monday, sale campaigns, etc, that our traffic can raise up to x3.

Using Kubernetes’s elasticity capabilities helped us in a lot of aspects:

  • Latency — ensure the best user experience. All you have to know is the optimal load that a single replica can handle.

Horizontal Pod Autoscaler

Kubernetes HPA supports different possible options for scaling out and in. We are using both container resource metrics (CPU and Memory) and custom metrics (applicative metrics collected by Prometheus).

While resource metrics are straightforward (targetAverageUtilization above 80%) — custom metrics are more interesting.

HPA based on resources wasn’t sufficient for us. Our web servers are generating asynchronous network calls to other internal services and databases, rather than doing some intensive CPU work.

In terms of memory, we are not doing anything fancy as well. We have some basic LRU/LFU cache layers to save some expensive calls to databases — but those are protected and limited to ensure we won’t exceed the container’s requests/limits. One concern that impacts our memory consumption is a burst of requests waited to be handled. In this scenario, our memory can increase drastically so we keep relatively enough extra space to handle a sudden spike in traffic.

Having said that, while exceeding memory limits will kill your containers, be aware that the CPU won’t kill your pod and will throttle instead. My advice is to keep enough memory so you won’t see your pods collapse one after another with not enough time to recover.

We decided to bet on a custom metric for our HPA — the average requests per pod. We tested how many requests/second each pod can handle while ensuring that our response-time meets our SLA and memory/CPU stay stable.

We are using prometheus-client (in our Python Tornado web-server) to collect applicative metrics:

from tornado.web import Application
from prometheus_client import Counter
MyApplication(Application):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.ready_state = True
self.requests_total = Counter(
namespace="ns",
subsystem="app",
name="requests_total",
labelnames=("handler", "method", "status"),
)
def log_request(self, handler):
super(MyApplication, self).log_request(handler)
handler_name = type(handler).__name__
method = handler.request.method
request_time = handler.request.request_time()
status = handler.get_status()
self.requests_total.labels(handler_name, method, status).inc()
def main():
application = MyApplication([...])
server = tornado.httpserver.HTTPServer(application)
server.listen(80)
tornado.ioloop.IOLoop.instance().start()

In the snippet above (inspired by tornado-prometheus) you can see how we count all incoming requests and labeling them with some useful information for visibility.

In addition to the custom metric we set CPU/Mem as a backup HPA metrics, but they were never kicked in. We are using kube-metrics-adapter to collect the custom metrics for the HPA. Now we can use the external metric:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: my_app
annotations:
metric-config.external.prometheus-query.prometheus/http_requests_total: |
sum(rate(ns_app_requests_total{release="my_app"}[2m]))
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my_app
minReplicas: 1
maxReplicas: 300
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
- type: Resource
resource:
name: memory
targetAverageUtilization: 80
- type: External
external:
metricName: prometheus-query
metricSelector:
matchLabels:
query-name: http_requests_total
targetAverageValue: 20

HPA response-time take us up to 2 minutes and it depends on:

  • Prometheus scrape_interval (default 1min).

Configuring minReplicas/maxReplicas replicas

There are several things to consider before deciding about minimum replicas:

  • What the nature of your traffic? Do you have a sudden burst? (like publishing a new campaign). For example, one of your pods can handle 20 requests/second and you have a sudden burst of 100 requests/second. There’s a huge difference if you currently have 10 replicas running or 2 replicas running. In the first case, each replica has to deal with 10 additional requests, 150% load. On the other hand, having only 2 replicas running, each one has to deal with additional 50 requests 250% load.

maxReplicas is easier. Setting limits is always good advice — as someone needs to pay the bill at the end of the day. You don’t want to wake up at the end of the month and find out that you ran x10 more replicas than you thought.

Traffic and HPA — 24 hours

In the graph above we can see a requests/second (left y-axis) and HPA combined (right y-axis) in 24 hours resolution. Dash yellow line shows the number of replicas (pods) and the colorful lines below shows the traffic per replica. You can see the correlation between the incoming traffic and replicas running.

HPA metric together with some other useful metrics can be observed without any applicative metric:

kube_hpa_status_current_replicas{hpa="my_app", namespace="ns"}
kube_hpa_status_desired_replicas{hpa="my_app", namespace="ns"}
kube_hpa_spec_min_replicas{hpa="my_app", namespace="ns"}
kube_hpa_spec_max_replicas{hpa="my_app", namespace="ns"}
Horizontal Pod Autoscaler — 7 days

The graph above (based on those metrics) shows the number of replicas in the last 7 days. You can see the min Replicas is 50 and the max Replicas is 300 (allow us to handle x3 more than the daily max replicas — ~100).

You can see the sine pattern, where the traffic is doubled in the rush hours (noon — afternoon hours) compared to night traffic (03:00–06:00). You can also see some spikes from time to time which means there was a sudden increase in the traffic rate. As you can guess, with this elasticity we saved a lot of money while adding and removing resources dynamically as we need them while ensuring that:

  • Our customer’s serving experience stays the same.

Summary

We saw how using custom metrics can help with auto-scaling our service in and out. After running with this setup for almost a year, we can definitely say that it saves us a lot of time, money and brought us happiness and peace :)

The Startup

Get smarter at building your thing. Join The Startup’s +741K followers.