Kubernetes Mount Propagation

TL;DR Essential reading before using Mount propagation.

Kubernetes has a highly configurable and versatile API system. This allows admins to configure and implement their infrastructure in any way they intend. The flip side is that less prominent but powerful features go unnoticed or worse — used in the wrong way.

One such feature is Mount Propagation, introduced in Kubernetes v1.8. Mount Propagation mode can be specified while creating Volume Mounts in Pods. In this article, we’ll explore what Mount Propagation is, how to use it, and more importantly, when to use it.

Mounts can be propagated through container boundaries (CC BY-SA 3.0, link)

Mount

Prior to diving into how Mount Propagation works, let’s take a detour to understand ‘Mounts’ in Linux. The file-system that we browse on our Linux machine is known as the Virtual File System or VFS.

The kernel uses VFS to hide the complexity of reading and writing information at a given physical location on disk. Among other data structures, VFS is made up of a tree of this data structure found in mount.h

struct vfsmount {
struct list_head mnt_hash;
struct vfsmount *mnt_parent; /* fs we are mounted on */
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted
tree*/
struct super_block *mnt_sb; /* pointer to superblock */
struct list_head mnt_mounts; /* list of children,
anchored here */
struct list_head mnt_child; /* and going through their
mnt_child */
atomic_t mnt_count;
int mnt_flags;
char *mnt_devname; /* Name of device e.g.
/dev/dsk/hda1 */
struct list_head mnt_list;
};

In order to understand vfsmount, we need to first understand dentry.

Directory entry (dentry)

A dentry struct is used to represent inode, file name, parent directory, other files in same directory (siblings), and sub-directory (children; if the dentry is for a directory) etc. This struct can be found in dcache.h

struct dentry {
struct inode *d_inode;
struct dentry *d_parent;
struct qstr d_name;
struct list_head d_subdirs; /* sub-directories/files */
struct list_head d_child; /* sibling directories/files */
...
}

In the vfsmount struct, the following fields are of type dentry

struct vfsmount {
...
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted
tree*/
...
};

When the operating system boots, it creates a vfsmount entry with mnt_mountpoint set to / (encoded in the dentry struct). This operation of creating vfsmount entry for a specific dentry is what is commonly referred to as Mounting. A tree of dentry structs form the file-system.

                          root
/ \
tmp usr
/
m1
/ \
tmp usr
/
m1

Note: dentry structure is used when the cd command is invoked.

Bind Mounts

Each Mount operation involves creation of vfsmount and dentry data structures (and a few other structs irrelevant to this discussion) in the kernel. Since it is possible to create these data structures multiple times, it is possible to mount a device multiple times. In fact all of the below operations are valid:

  1. Mount a device at one path
  2. Mount a device multiple times at different paths
  3. Mount a device multiple times at the same path
  4. Mount a directory at another path

The first three operations work by creating a vfsmount and creating a new dentry pointing to a newly created inode that contains the device information. i.e. new instances of the following data structures are created

  1. vfsmount
  2. dentry
  3. inode

In case of operation 4, a vfsmount is created with the dentry of the original directory.

This operation of creating a vfsmount and pointing its mnt_root (and a few other fields irrelevant to this discussion) to an existing dentry struct is what is commonly referred to as Bind Mounting.

Containers

When a container is created, a new vfsmount tree is created. This tree has no association to the host’s vfsmount tree. This is why we see different files inside the container and outside.

# before creating a container
A (/)
/ \
(/proc) B C (/tmp)
# create a container
docker run -d -it ubuntu
# after creating a container
A (/) # Host vfsmount tree
/ \
(/proc) B C (/tmp)
--------------------------------------------
                  A'(/)                 # Container's vfsmount tree
/ \
(/proc) B' C' (/tmp)

If we bind-mount a directory into a container, then the container’s vfsmount tree gets a new entry, and the dentry of this new entry inside the container will point to the dentry of the directory on the host. It looks like this:

# create a container
docker run -d -it -v /tmp/path:/tmp/path ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'-->
(dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'

Note that any* mounts under /tmp/path on the host which exist during container creation will get a corresponding copy inside the container.

* unbindable mounts are not copied, more information in the next section.

Mount Propagation

When a volume is bind-mounted into a container, we know that the dentry of the vfsmount in the container and outside are one and the same. However, what happens when a new mount (vfsmount) is created in a sub-directory of a bind-mounted directory after the container has started?

For instance, consider this

# create a container
docker run -d -it -v /tmp/path:/tmp/path ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'-->
(dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'
mount --bind /some/path /tmp/path/some/path
Will the contents of /tmp/path/some/path be available inside the container?

This answer is No, because the vfsmounts trees are not shared between the host and container, even though the dentry values are shared.

We need to somehow instruct the kernel to share the vfsmounts trees between host and container. The Kernel provides the modality of Mount Propagation for sharing vfsmount sub-trees. There are two modes:

  1. Shared bind-mount
    * shared 
    * rshared
  2. Non-Shared bind-mount
    * slave
    * rslave 
    * unbindable

Shared Bind Mount

A Shared bind mount denotes that mount events propagate from host to container and from container to host. i.e Bidirectional mount propagation.

rshared stands for recursive-shared, where the shared property is automatically applied to all subsequent mounts in any sub-tree of the mount-point.

This is implemented as a shared sub-tree of vfsmounts between the host and container. In the above example, If shared bind mount is enabled, then the view under the hood will be equivalent to the vfsmount entry and its sub-tree being shared between host and the container.

Non-Shared Bind Mount

A non-shared bind mount can be further classified into two types

  1. Slave, Rslave
  2. Private — Prevents any propagation of mount events
  3. Unbindable — Does not apply to containers

A Slave bind mount denotes that mount events propagate from host to container but not from container to host.

rslave stands for recursive-slave, where the slave property is automatically applied to all subsequent mounts in any sub-tree of the mount-point.

Note that by default, the docker run -v ... command used rprivate mounts. i.e. The sample container from above will be NOT be able to access the contents of /some/path inside /tmp/path/some/path.

Docker provides an option to configure mount propagation flags

# create a container
docker run -d -it --mount type=bind, \
src=/tmp/path, \
target=/tmp/path, \
bind-propagation=shared
\
ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'-->
(dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'
mount --bind /some/path /tmp/path/some/path

Now that the bind-propagation mode has been set to shared, the contents of /tmp/path/some/path will be the same on the host and in the container. i.e. the mount is propagated from host to container and back.

There is one other criterion for the resulting propagation mode — the existing mount flag on the source mount-point. It is possible that the source was created with a particular mount flag, possibly given by a rshared, rslave, or rprivate mount of a parent.

A summary of the resulting mount propagation mode, while also considering this new criterion is provided below

BIND MOUNT OPERATION
----------------------------------------------------------------
| source | dest | result |
----------------------------------------------------------------
| shared | shared | shared |
| | non-shared | shared |
================================================================
| private | shared | shared |
| | non-shared | private |
================================================================
| slave | shared | result |
| | non-shared | result |
================================================================
| unbindable | invalid | invalid |
----------------------------------------------------------------

Kubernetes

Kubernetes supports a subset of the bind-mount propagation modes we’ve discussed. This can be configured while bind-mounting volumes into containers (within a podSpec). It supports these two modes

  1. Bidirectional —same as rshared
  2. HostToContainer — same as rslave. Default mode.

Note: I’ve used the cleaner and simpler Koki Short syntax to declare these resources.

deployment:
containers:
- image: gcr.io/google_containers/busybox:1.24
name: reader
volume:
- mount: /usr/test-pod
store: local-vol
propagation: bidirectional
name: local-test-reader
version: extensions/v1beta1
volumes:
local-vol: pvc:example-local-claim
---
persistent_volume:  
modes: rw-once
name: example-local-pv
path: /mnt/disks/ssd1 # Local dir to bind-mount
reclaim: delete
storage: 5Gi
storage_class: local-storage
version: v1
vol_type: local
---
pvc:
access_modes:
- rw_once
name: example-local-claim
storage: 5Gi
storage_class: local-storage
version: v1

The above example shows a deployment with one container that mounts a volume with propagation mode rshared (Bidirectional in Kubernetes speak).

Note: The Kubernetes syntax for the above file (obtained by running short -k -f mount-propagation.short.yaml) is available here.

Use Cases

Mount propagation mode should be chosen carefully. It can be a security risk if chosen wrong. Here are some valid use cases for Bidirectional Mount propagation

  • Attaching a device from inside the container. For Instance, attaching a ISCSI device from inside the container. This is because if the container dies, the host will not have the necessary information (unless bidirectional mount propagation is used) to flush writes and detach correctly. I’ve run into this problem before.
  • Sharing a device between different pods, where mounts happen inside the pod, but are shared between the pods.

Conclusion

The syntax used in the Kubernetes spec is called the Short Syntax, and it is designed to be a more readable form of YAML than the machine-oriented Kubernetes API spec that exists today. It provides the following advantages for its users

  • Tunes out the noise from the manifests and makes them easier to read and write
  • Lets you get a lot more done, a lot more quickly
  • Completely translates from Kubernetes to Short and back Kubernetes without losing any information
  • Completely free
  • Comes with a Chrome Plugin
  • Completely Open Source

Go try out Koki Short, and stay tuned for more deep dives into Kubernetes!