Docking A Docker Container — Part 2 : Namespace, cgroup
Previous blog in this series.
In my last post, i concluded by saying docker is a platform, so let us understand that and how docker can be used in design. But before we get into the usage part of the docker it is essential to see whats inside it, because that make us do far better design discussions for any application.
Docker is a lightweight software that creates isolated execution environment for applications, where no two applications can interfere with each other and yet each application presumes that they are alone running on the entire system. Docker is not a virtual machine but bunch of processes with special attributes running on the plain linux kernel and more transparent than virtual machine. Docker doesn’t reside inside kernel, but ‘namespace’ and ‘cgroups’ do and docker creates a cosy little environment called container using them. Doesn’t that sounds interesting? So let’s get into the internals, the way it has leveraged existing linux resources and particularly namespaces. This post talks about how such environment can be created over Linux.
Linux Namespace and Docker Isolation
Why isolation is required? In few words : for security, high-availability, dependences avoidance, testability, deployment, etc. The requirement also asks for better control of the application, and this is done by breaking the application into small logical subsystems (inside each container) such that they can be monitored and controlled independently and easily. But do not get carried away by breaking the application so small such that it adds up latency and overhead of the containers in itself. By the way such breaking down of a large application into small logical parts is called Microservices.
Now comes the real meaty part as to how containers can be made isolated? This is done by running process inside ‘containers’. “thats right!” running UNIX processes in the docker container is like running them inside virtual machine. Virtual machine(VM) typically emulates hardware by running guest OS on top of Host OS to create process isolation, and hence VMs are that heavy. Whereas Docker container uses few OS features (including namespace) and creates lightweight isolation.
With introduction of linux namespace, ‘nested’ process-trees are possible, this means each process can have its own isolated process-tree along with the system resources like (process IDs, hostnames, user IDs, network access, interprocess communication, and filesystems). Now a process from different process-tree cannot inspect/kill a process in other process-tree.
With every system boot-up, PID1 (also called ‘init’ OR root) process starts up and all other process start below this process in the tree. With PID namespace isolation, processes in the child namespace have no way of knowing of the parent process’s existence. However, processes in the parent namespace have a complete view of processes in the child namespace, as if they were any other process in the parent namespace.
- To create a new namespace for a container with specific flags defined in for clone() system call.
/* CLONE_NEWPID would create new PID namespace */
clone(cb, *stack, CLONE_NEWPID | SIGCLD, NULL);
- Likewise, each container can have its own network namespace. This essentially means that each container will have its own networking stack (device interfaces, TCP/IP protocol stack, IP table, firewall rules, directory tree, etc)
/* CLONE_NEWNET would create new network namespace */
clone(cb, *stack, CLONE_NEWPID | CLONE_NEWNET | SIGCLD, NULL);
- All processes live in mount namespace and the flag to enable this feature is CLONE_NEWNS
/* CLONE_NEWNS would create new mount namespace */
clone(cb, *stack, CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCLD, NULL);
- Interprocess Communication(IPC), UNIX Timesharing System (UTS), User ID:
/* CLONE_NEWUTS | CLONE_NEWIPC would create new PID namespace */
clone(cb, *stack, CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_UTS | CLONE_NEWIPC | SIGCLD, NULL);
So far we learned how containers can be isolated in a system giving all the process level infrastructure they need for functioning. Now, comes the question what if the management of these containers for faithful system utilization OR some way to control the container based system for better system resource utilization. Fortunately, this mystery can be solved by yet another linux feature called cgroups (control groups).
Control Groups : cgroups
cgroups are essentially resource manager that can control following physical resource in the system per group of processes:
Configure resource allocation:
- CPU consumption
- Memory consumption
- Desk I/O consumption
- Network consumption
- Device limitation
- Resource accounting.
There are two different control parameters/rules that can be applied to a group:
- Limits: Its like the absolute value its like how many bytes of memory a process can consume.
- Priorities: The amount of share a process get from the bucket of a given resource. By default all prioritization switched are set to balanced, i.e. all the resource in the system are distributed equally among all the process in the group including CPU and disk IO. But based on the application requirements, at times resource allocation needs to change for certain groups.
Limiting memory is easy compared to limiting CPU. Limit can be of two types: Hard and Soft. Hard limit — if system goes above the hard limit, a random process gets killed. but if docker container goes above the hard limits then container process gets killed and not any random container. This is another reason to have one service per container. Soft limit — If a process goes above soft limit is just fine, but if the overall system starve for the memory then most likely the memory pages will be taken from such over limit process by the kernel. In fact now with oom-notifier (avoiding OOM killer) we can set notification in the cgroup when limit is reached: Freeze all process in the group, notify user space, kill/raise limits/migrate container, when the memory limit clears for system-unfreeze all the processes in the group.
Brief summary of control files (details).
tasks # attach a task(thread) and show list of threads
cgroup.procs # show list of processes
cgroup.event_control # an interface for event_fd()
memory.usage_in_bytes # show current usage for memory
memory.memsw.usage_in_bytes # show current usage for memory+Swap
memory.limit_in_bytes # set/show limit of memory usage
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
memory.failcnt # show the number of memory usage hits limits
memory.memsw.failcnt # show the number of memory+Swap hits limits
memory.max_usage_in_bytes # show max memory usage recorded
memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
memory.soft_limit_in_bytes # set/show soft limit of memory usage
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
memory.numa_stat # show the number of memory usage per numa node
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
memory.kmem.usage_in_bytes # show current kernel memory allocation
memory.kmem.failcnt # show the number of kernel memory usage hits limits
memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits
memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
This checks CPU usage on the granularity of whole cgroup. It keeps track of following:
- Keeps track of user/system CPU time.
- Keeps track of usage per CPU.
- Allow to set weights (Shares of the CPU).
- Can’t set limits: There is a trade-off here! Say if you set small amount of CPU percentage and tons of CPU cycles are available, most of the modern day CPU will step down the clocking speed then the regular speed, that slows down everything but now you need more CPU percentage for your application and like wise this becomes a catch-22 situation. Therefore, we can try giving CPU cycles but again processors are RISC Vs CISC and its difficult to know the types of optimization your application will go through on these CPUs, hence counting number of instructions would also not going to make sense either. In this case you can set percentage of CPU is better. This is the reason cgroup has introduce ‘weights’ OR ‘share of the CPU’. This is very nice article about cgroup usage.
CPU limiting can be of two types:
1. cgroups on certain CPU cores: At times application require a group of tasks needs to run on a particular CPU core.
2. limiting the actual usage: When a particular task needs more CPU shares for execution w.r.t. others.
Brief summary of control files (details):
cpu.shares # Specify a relative share of CPU time available to the tasks in a cgroup.
cpuset.cpus # list of CPUs in that cpuset
cpuset.mems # list of Memory Nodes in that cpuset
cpuset.memory_migrate # if set, move pages to cpusets nodes
cpuset.cpu_exclusive # is cpu placement exclusive?
cpuset.mem_exclusive # is memory placement exclusive?
cpuset.mem_hardwall # is memory allocation hardwalled
cpuset.memory_pressure # measure of how much paging pressure in cpuset
cpuset.memory_spread_page # if set, spread page cache evenly on allowed nodes
cpuset.memory_spread_slab # if set, spread slab cache evenly on allowed nodes
cpuset.sched_load_balance # if set, load balance within CPUs on that cpuset
cpuset.sched_relax_domain_level # the searching range when migrating tasks
cpuset.memory_pressure_enabled # compute memory_pressure?
Limiting Block IO
There are some switches to control for block IO too, but they are similar to the CPU parameters and hence will not cover them but leave you with this document.
How docker container to container networking happens, Next...