Disabling Transparent Huge Pages in Kubernetes

I’ve recently needed to revisit some of our deployments which were created in the earlier days of GKE where some useful features were not available. One component revisited was the disabling the kernel setting Transparent Huge Pages (THP) recommended for mongo and redis.

The solution at the time was to use a Daemonset running a startup script with gcr.io/google-containers/startup-script:v1.

There are a couple of areas that could be improved

  • hostPID and securityContext seemed excessive
  • No checks if the setting actually changed
  • gcr.io/google-containers/startup-script:v1 is a relatively large image
  • Timing conflicts with pod scheduling

hostPID and securityContext

Instead of using hostPID and priviledged: true, we can mount the host’s /sysinto the pod as a volume.

volumes:
- name: sys
hostPath:
path: /sys
volumeMounts:
- name: sys
mountPath: /rootfs/sys

Checking if settings applied

This part is straight forward. We simply grep the property and return an appropriate exit code.

grep -q -F [never] /sys/kernel/mm/transparent_hugepage/enabled
grep -q -F [never] /sys/kernel/mm/transparent_hugepage/defrag

Large Images

This one is not a critical problem. gcr.io/google-containers/startup-scriptis 12.5MB, but since we are essentially just running a shell script, it can be changed to a slimmer image, like busyboxwhich has an image size of 1.15MB. Of course busybox is lacking the startup functionality of gcr.io/google-containers/startup-script. For this we can utilize initContainers which were unavailable at the time.

Pod Scheduling Conflicts

This problem is referring to a dependency conflict where redis or mongo can be scheduled on a node where the kernel-tuner has not yet completed. Since the process was started before the setting was applied, it will not receive the updated kernel settings and would need a restart.

For this problem, we can use labels on nodes in conjunction with nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution. This solution will pend the pod scheduling until a node with the proper labels exist.

To label a node within a pod, there are some prerequisites.

  • kubectl label node needs RBAC permission (skip if it is not required). For my case, I used the service account node-controller that is created by default on kube-systemnamespace on GKE by setting serviceAccountName: node-controller
  • Pod needs to know the node name it lives on via Downward API
initContainers:
- name: label-node
image: swaglive/kubectl:1.11
command: ["kubectl"]
args: ["label", "node", "--overwrite", "$(NODE_NAME)", "sysctl/mm.transparent_hugepage.enabled=never", "sysctl/mm.transparent_hugepage.defrag=never"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName

Now to add the label restriction to pods that need it.

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: sysctl/mm.transparent_hugepage.enabled
operator: In
values:
- "never"
- key: sysctl/mm.transparent_hugepage.defrag
operator: In
values:
- "never"

Putting it all together

Using it with a redisdeployment