CFS Bandwidth Control-Warmup

5 min readJun 14, 2019

In this Mark’s post we could know that Kubernetes set Pod CPU constraints with Cgroups subsystems: cfs.cpu_period_us and cfs.cpu_quota_us. And as we will talk about these two subsystems, we should first get to know CFS bandwidth control.

This is the first blog of CFS bandwidth control. Let’s warm up for this whole thing.

What is CFS bandwidth control?

This maybe the first question come up to our minds. At the beginning we should make it clear that CPU is also a managable resource , and it could be distributed very fine-grained. There are diverse scenarios which need CFS bandwidth control, like multi-tenants cloud platform need pay-per-use pattern with resource hard limits. On the other hand, a system without CFS bandwidth control may lead to unexpected resource contention or latency.

To solve these problems, Paul Turner and Bharata Rao proposed several CFS bandwidth control kernel patches. CFS bandwidth control described their works. Bharata’s paper told us that CFS bandwidth control set CPU upper limit that a task group could use, by configure quota and period, which means the allowable CPU bandwidth quota with this enforced interval period.

Know-how

Before we start, another scheduling principle that scheduled entity became task group since Linux kernel 2.6, trying to solve unfairness when multiple processes of different users need scheduling. Imaging that user A has 9 real time processes and user B has 1 non-interactive process, because real time process hold higher priority, processes of user A will get a lot of CPU time that would bring user B unfairness. With task group scheduling, any user could get fair enough CPU time and then divide it in their task group. This task group is really a data structure task_group in kernel/sched/sched.h:L363:

struct task_group {
#ifdef CONFIG_FAIR_GROUP_SCHED // Control Fair Group Scheduling
  /* schedulable entities of this group on each CPU */
  struct sched_entity **se; // Scheduling Entity
  /* runqueue "owned" by this group on each CPU */
  struct cfs_rq   **cfs_rq;
#endif
  struct cfs_bandwidth  cfs_bandwidth; // CPU Bandwidth
};

There is a member named sched_entity which could be process, task group even user. These scheduling entities is organized in one Red-Black Tree, ordered by vruntime, managed by cfs_rq. cfs_rq is likewise another memberof task_group, defined in kernel/sched/sched.h:L488:

struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
  struct rq   *rq;  /* CPU runqueue to which this cfs_rq is attached */
};

cfs_rq has a member called rq, it’s the main per-CPU runqueue structure, which cfs_rq would attached to.

Let’s get back to the definition of task_group, we could also see a member cfs_bandwidth, defined in kernel/sched/sched.h:L337:

struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
  ktime_t     period;   // period
  u64     quota;        // quota 
#endif
};

So it’s pretty clear that period and quota are both members of cfs_bandwidth. Let’s wrap it up: CFS bandwidth control has an effect on task_group, with configured period and quota, then divided quota to cfs_rq, like below:

Further, we can see a global quota pool which belongs to task_group, and every cfs_rq has a local quota pool. Every time cfs_rq apply the global quota pool for a slice(configured as /proc/sys/kernel/sched_cfs_bandwidth_slice_us, default 5ms), like below:

After received a slice, sched_entities in cfs_rq would start running, and keep applying after every slice is used up. If there are no more slices, which means the cfs_rq used up all the allowable quota in this period, it will be throttled, at the meanwhile tasks couldn’t run anymore. The throttling statistics could be checked with Cgroup cpu.stat. After this period, global quota pool will be refreshed and cfs_rq get out of throttled state, tasks continue running.

Let’s get back to struct cfs_bandwidth:

struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
  ktime_t     period;   // period
  u64     quota;        // quota
  u64     runtime;      // time remaining, equal quota in the beginning

  struct hrtimer    period_timer; // period interval timer
  struct hrtimer    slack_timer;  // slack timer
  struct list_head  throttled_cfs_rq; 

  /* Statistics: */            // cpu.stat data source
  int     nr_periods;          // number of periods
  int     nr_throttled;        // number of throttle states
  u64     throttled_time;      // number of throttled time
#endif
};

We can see period is implemented with a hrtimer, throttled cfs_rq will be added into throttled_cfs_rq linked list. runtime equal quota in the beginning, decreasing with task running. cpu.stats data source from nr_period, nr_throttled and throttled_time.

Get your hands dirty

Using a GCP single CPU server, and cfs.go, which will burn CPU with sha512 in 5ms, and then get into sleep, going round and round. The sleep interval and number of iteration could be set with command line flags.

For convenience, we use Docker image golang:1.12.5 for test.

Let’s see what happened if no bandwidth control, executing docker run — rm -it -v $(pwd):$(pwd) -w $(pwd) golang:1.12.5 go run cfs.go -iterations 20 -sleep 10ms

Every burn took 5ms, just like what we write.

Then we set quota with 20ms, period with 100ms, slice is still 5ms as default, executing docker run — rm -it — cpu-quota 25000 — cpu-period 100000 -v $(pwd):$(pwd) -w $(pwd) golang:1.12.5 go run cfs.go -iterations 20 -sleep 10ms:

As we can see, there are several burn took 25ms, how does this happen? We explained below:

There 25ms since 5 times burn which satisfy the quota, and still 5 times sleep, so totally 75ms elapsed, there would be 25ms throttled in this period.

CFS Bandwidth Control-Warmup

Written by maxwell92