Auto-labeling GKE Nodes for XFS support

To use XFS with Persistent Volumes, the host node needs to have the command xfs_mkfile available so disks can be created and formatted. The problem comes when needing to do this on GKE where there are 2 OSs available, ubuntu which has xfsprogs installed by default and cos which does not. Also, before ubuntu was available, container-vm was the used for XFS, but required a separate install.

The solution I took was to add a XFS support node label at node pool creation time using OS image ubuntu. While this sort of works, there are a couple problems.

  • The need to remember to add a XFS node label when creating a new node pool
  • The assumption that ubuntuhas xfsprogs installed
  • The assumption that cos does not have xfsprogs installed. (It doesn’t right now but you never know)

Detecting XFS host support

Since provisioning a PersistentVolume is done on the host before the Pod is running, this complicates things as the support would need to be for the host, but the detection done in a Pod.

For this, we will use nsenter. nsenter allows running a process in a different namespace, in our case, the host. For us to properly use nsenter, we will need to set hostPID:true and priviledged: true, allowing us to break out of the pod into the host.

hostPID: true
volumes:
- name: tmp
emptyDir: {}
initContainers:
- name: detect
image: wardsco/nsenter
command: ["sh", "-eo", "pipefail", "-c"]
args: ["nsenter -t 1 -m -u -i -n -p -- sh -c 'command -v xfs_mkfile' && touch /tmp/xfs_mkfile || true"]
securityContext:
privileged: true
volumeMounts:
- name: tmp
mountPath: /tmp/

The command command -v xfs_mkfile gets run on the host which detects if the command xfs_mkfile is available.

Labeling the node for support

The usage of the /tmp volume mount is to pass the results of detection into the labeling container. This is to drop the priviledged: true as soon as possible.

initContainers
- name: label
image: wardsco/kubectl:1.11
command: ["sh", "-eo", "pipefail", "-c"]
args: ["kubectl label node --overwrite $NODE_NAME fs.type/xfs=$(test -e /tmp/xfs_mkfile && echo 'true' || echo 'false'"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: tmp
mountPath: /tmp/
readOnly: true

By using the Downward API, we can pass in the node name to the pod. This container will then label the node it is on with the label fs.type/xfs=true/false indicating support. With this, you can schedule pods with nodeAffinity.

Note: kubectl label node requires extra permissions which won’t be covered here. As a shortcut, setting spec.serviceAccountName: node-controller in the kube-system namespace provides these permissions.

Putting it together