Disabling Transparent Huge Pages in Kubernetes

Allan Lei
Allan Lei
Sep 15, 2018 · 2 min read

I’ve recently needed to revisit some of our deployments which were created in the earlier days of GKE where some useful features were not available. One component revisited was the disabling the kernel setting Transparent Huge Pages (THP) recommended for mongo and redis.

The solution at the time was to use a Daemonset running a startup script with gcr.io/google-containers/startup-script:v1.

There are a couple of areas that could be improved

  • hostPID and securityContext seemed excessive
  • No checks if the setting actually changed
  • gcr.io/google-containers/startup-script:v1 is a relatively large image
  • Timing conflicts with pod scheduling

hostPID and securityContext

Instead of using hostPID and priviledged: true, we can mount the host’s /sysinto the pod as a volume.

volumes:
- name: sys
hostPath:
path: /sys
volumeMounts:
- name: sys
mountPath: /rootfs/sys

Checking if settings applied

This part is straight forward. We simply grep the property and return an appropriate exit code.

grep -q -F [never] /sys/kernel/mm/transparent_hugepage/enabled
grep -q -F [never] /sys/kernel/mm/transparent_hugepage/defrag

Large Images

This one is not a critical problem. gcr.io/google-containers/startup-scriptis 12.5MB, but since we are essentially just running a shell script, it can be changed to a slimmer image, like busyboxwhich has an image size of 1.15MB. Of course busybox is lacking the startup functionality of gcr.io/google-containers/startup-script. For this we can utilize initContainers which were unavailable at the time.

Pod Scheduling Conflicts

This problem is referring to a dependency conflict where redis or mongo can be scheduled on a node where the kernel-tuner has not yet completed. Since the process was started before the setting was applied, it will not receive the updated kernel settings and would need a restart.

For this problem, we can use labels on nodes in conjunction with nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution. This solution will pend the pod scheduling until a node with the proper labels exist.

To label a node within a pod, there are some prerequisites.

  • kubectl label node needs RBAC permission (skip if it is not required). For my case, I used the service account node-controller that is created by default on kube-systemnamespace on GKE by setting serviceAccountName: node-controller
  • Pod needs to know the node name it lives on via Downward API
initContainers:
- name: label-node
image: swaglive/kubectl:1.11
command: ["kubectl"]
args: ["label", "node", "--overwrite", "$(NODE_NAME)", "sysctl/mm.transparent_hugepage.enabled=never", "sysctl/mm.transparent_hugepage.defrag=never"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName

Now to add the label restriction to pods that need it.

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: sysctl/mm.transparent_hugepage.enabled
operator: In
values:
- "never"
- key: sysctl/mm.transparent_hugepage.defrag
operator: In
values:
- "never"

Putting it all together

Using it with a redisdeployment

Allan Lei

Written by

Allan Lei

The Adventures of Me

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade