Kubernetes Mount Propagation
TL;DR Essential reading before using Mount propagation.
Kubernetes has a highly configurable and versatile API system. This allows admins to configure and implement their infrastructure in any way they intend. The flip side is that less prominent but powerful features go unnoticed or worse — used in the wrong way.
One such feature is Mount Propagation, introduced in Kubernetes v1.8. Mount Propagation mode can be specified while creating Volume Mounts in Pods. In this article, we’ll explore what Mount Propagation is, how to use it, and more importantly, when to use it.
Mount
Prior to diving into how Mount Propagation works, let’s take a detour to understand ‘Mounts’ in Linux. The file-system that we browse on our Linux machine is known as the Virtual File System or VFS.
The kernel uses VFS to hide the complexity of reading and writing information at a given physical location on disk. Among other data structures, VFS is made up of a tree of this data structure found in mount.h
struct vfsmount {
struct list_head mnt_hash;
struct vfsmount *mnt_parent; /* fs we are mounted on */
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted
tree*/
struct super_block *mnt_sb; /* pointer to superblock */
struct list_head mnt_mounts; /* list of children,
anchored here */
struct list_head mnt_child; /* and going through their
mnt_child */
atomic_t mnt_count;
int mnt_flags;
char *mnt_devname; /* Name of device e.g.
/dev/dsk/hda1 */
struct list_head mnt_list;
};
In order to understand vfsmount
, we need to first understand dentry
.
Directory entry (dentry)
A dentry
struct is used to represent inode, file name, parent directory, other files in same directory (siblings), and sub-directory (children; if the dentry
is for a directory) etc. This struct can be found in dcache.h
struct dentry {
struct inode *d_inode;
struct dentry *d_parent;
struct qstr d_name;
struct list_head d_subdirs; /* sub-directories/files */
struct list_head d_child; /* sibling directories/files */
...
}
In the vfsmount
struct, the following fields are of type dentry
struct vfsmount {
...
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted
tree*/
...
};
When the operating system boots, it creates a vfsmount
entry with mnt_mountpoint
set to /
(encoded in the dentry
struct). This operation of creating vfsmount
entry for a specific dentry
is what is commonly referred to as Mounting. A tree of dentry
structs form the file-system.
root
/ \
tmp usr
/
m1
/ \
tmp usr
/
m1
Note: dentry
structure is used when the cd
command is invoked.
Bind Mounts
Each Mount operation involves creation of vfsmount
and dentry
data structures (and a few other structs irrelevant to this discussion) in the kernel. Since it is possible to create these data structures multiple times, it is possible to mount a device multiple times. In fact all of the below operations are valid:
- Mount a device at one path
- Mount a device multiple times at different paths
- Mount a device multiple times at the same path
- Mount a directory at another path
The first three operations work by creating a vfsmount
and creating a new dentry
pointing to a newly created inode
that contains the device
information. i.e. new instances of the following data structures are created
vfsmount
dentry
inode
In case of operation 4, a vfsmount
is created with the dentry
of the original directory.
This operation of creating a
vfsmount
and pointing itsmnt_root
(and a few other fields irrelevant to this discussion) to an existingdentry
struct is what is commonly referred to as Bind Mounting.
Containers
When a container is created, a new vfsmount
tree is created. This tree has no association to the host’s vfsmount
tree. This is why we see different files inside the container and outside.
# before creating a container
A (/)
/ \
(/proc) B C (/tmp)# create a container
docker run -d -it ubuntu# after creating a container
A (/) # Host vfsmount tree
/ \
(/proc) B C (/tmp)
-------------------------------------------- A'(/) # Container's vfsmount tree
/ \
(/proc) B' C' (/tmp)
If we bind-mount
a directory into a container, then the container’s vfsmount
tree gets a new entry, and the dentry
of this new entry inside the container will point to the dentry
of the directory on the host. It looks like this:
# create a container
docker run -d -it -v /tmp/path:/tmp/path ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'--> (dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'
Note that any* mounts under /tmp/path
on the host which exist during container creation will get a corresponding copy inside the container.
* unbindable
mounts are not copied, more information in the next section.
Mount Propagation
When a volume is bind-mounted into a container, we know that the dentry
of the vfsmount
in the container and outside are one and the same. However, what happens when a new mount (vfsmount
) is created in a sub-directory of a bind-mounted directory after the container has started?
For instance, consider this
# create a container
docker run -d -it -v /tmp/path:/tmp/path ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'--> (dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'mount --bind /some/path /tmp/path/some/path
Will the contents of
/tmp/path/some/path
be available inside the container?
This answer is No, because the vfsmounts
trees are not shared between the host and container, even though the dentry
values are shared.
We need to somehow instruct the kernel to share the vfsmounts
trees between host and container. The Kernel provides the modality of Mount Propagation for sharing vfsmount
sub-trees. There are two modes:
- Shared bind-mount
*shared
*rshared
- Non-Shared bind-mount
*slave
*rslave
*unbindable
Shared Bind Mount
A Shared bind mount denotes that mount events propagate from host to container and from container to host. i.e Bidirectional mount propagation.
rshared
stands for recursive-shared, where the shared
property is automatically applied to all subsequent mounts in any sub-tree of the mount-point.
This is implemented as a shared sub-tree of vfsmounts
between the host and container. In the above example, If shared
bind mount is enabled, then the view under the hood will be equivalent to the vfsmount
entry and its sub-tree being shared between host and the container.
Non-Shared Bind Mount
A non-shared bind mount can be further classified into two types
- Slave, Rslave
- Private — Prevents any propagation of mount events
- Unbindable — Does not apply to containers
A Slave bind mount denotes that mount events propagate from host to container but not from container to host.
rslave
stands for recursive-slave, where the slave
property is automatically applied to all subsequent mounts in any sub-tree of the mount-point.
Note that by default, the docker run -v ...
command used rprivate
mounts. i.e. The sample container from above will be NOT be able to access the contents of /some/path
inside /tmp/path/some/path
.
Docker provides an option to configure mount propagation flags
# create a container
docker run -d -it --mount type=bind, \
src=/tmp/path, \
target=/tmp/path, \
bind-propagation=shared \
ubuntu
--------------------------------------------------------------
Host vfmount tree | Container's vfsmount tree
--------------------------------------------------------------
A (/) | A'(/)
\ | \
C (/tmp) | C'(/tmp)
| | |
'--> (dentry /tmp) | (dentry /tmp) <-'
| | |
'---> (dentry /path) <----'mount --bind /some/path /tmp/path/some/path
Now that the bind-propagation
mode has been set to shared, the contents of /tmp/path/some/path
will be the same on the host and in the container. i.e. the mount is propagated from host to container and back.
There is one other criterion for the resulting propagation mode — the existing mount flag on the source mount-point. It is possible that the source was created with a particular mount flag, possibly given by a rshared
, rslave
, or rprivate
mount of a parent.
A summary of the resulting mount propagation mode, while also considering this new criterion is provided below
BIND MOUNT OPERATION
----------------------------------------------------------------
| source | dest | result |
----------------------------------------------------------------
| shared | shared | shared |
| | non-shared | shared |
================================================================
| private | shared | shared |
| | non-shared | private |
================================================================
| slave | shared | result |
| | non-shared | result |
================================================================
| unbindable | invalid | invalid |
----------------------------------------------------------------
Kubernetes
Kubernetes supports a subset of the bind-mount propagation modes we’ve discussed. This can be configured while bind-mounting volumes into containers (within a podSpec). It supports these two modes
- Bidirectional —same as
rshared
- HostToContainer — same as
rslave
. Default mode.
Note: I’ve used the cleaner and simpler Koki Short syntax to declare these resources.
deployment:
containers:
- image: gcr.io/google_containers/busybox:1.24
name: reader
volume:
- mount: /usr/test-pod
store: local-vol
propagation: bidirectional
name: local-test-reader
version: extensions/v1beta1
volumes:
local-vol: pvc:example-local-claim---persistent_volume:
modes: rw-once
name: example-local-pv
path: /mnt/disks/ssd1 # Local dir to bind-mount
reclaim: delete
storage: 5Gi
storage_class: local-storage
version: v1
vol_type: local---pvc:
access_modes:
- rw_once
name: example-local-claim
storage: 5Gi
storage_class: local-storage
version: v1
The above example shows a deployment with one container that mounts a volume with propagation mode rshared
(Bidirectional
in Kubernetes speak).
Note: The Kubernetes syntax for the above file (obtained by running short -k -f mount-propagation.short.yaml
) is available here.
Use Cases
Mount propagation mode should be chosen carefully. It can be a security risk if chosen wrong. Here are some valid use cases for Bidirectional
Mount propagation
- Attaching a device from inside the container. For Instance, attaching a ISCSI device from inside the container. This is because if the container dies, the host will not have the necessary information (unless bidirectional mount propagation is used) to flush writes and detach correctly. I’ve run into this problem before.
- Sharing a device between different pods, where mounts happen inside the pod, but are shared between the pods.
Conclusion
The syntax used in the Kubernetes spec is called the Short Syntax, and it is designed to be a more readable form of YAML than the machine-oriented Kubernetes API spec that exists today. It provides the following advantages for its users
- Tunes out the noise from the manifests and makes them easier to read and write
- Lets you get a lot more done, a lot more quickly
- Completely translates from Kubernetes to Short and back Kubernetes without losing any information
- Completely free
- Comes with a Chrome Plugin
- Completely Open Source
Go try out Koki Short, and stay tuned for more deep dives into Kubernetes!