I’ve recently needed to revisit some of our deployments which were created in the earlier days of GKE where some useful features were not available. One component revisited was the disabling the kernel setting Transparent Huge Pages (THP) recommended for mongo and redis.
The solution at the time was to use a Daemonset running a startup script with gcr.io/google-containers/startup-script:v1.
There are a couple of areas that could be improved
- No checks if the setting actually changed
gcr.io/google-containers/startup-script:v1is a relatively large image
- Timing conflicts with pod scheduling
hostPID and securityContext
Instead of using
priviledged: true, we can mount the host’s
/sysinto the pod as a volume.
- name: sys
- name: sys
Checking if settings applied
This part is straight forward. We simply grep the property and return an appropriate exit code.
grep -q -F [never] /sys/kernel/mm/transparent_hugepage/enabled
grep -q -F [never] /sys/kernel/mm/transparent_hugepage/defrag
This one is not a critical problem.
12.5MB, but since we are essentially just running a shell script, it can be changed to a slimmer image, like
busyboxwhich has an image size of
1.15MB. Of course
busybox is lacking the startup functionality of
gcr.io/google-containers/startup-script. For this we can utilize
initContainers which were unavailable at the time.
Pod Scheduling Conflicts
This problem is referring to a dependency conflict where
mongo can be scheduled on a node where the
kernel-tuner has not yet completed. Since the process was started before the setting was applied, it will not receive the updated kernel settings and would need a restart.
For this problem, we can use
labels on nodes in conjunction with
nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution. This solution will pend the pod scheduling until a node with the proper labels exist.
To label a node within a pod, there are some prerequisites.
kubectl label nodeneeds RBAC permission (skip if it is not required). For my case, I used the service account
node-controllerthat is created by default on
kube-systemnamespace on GKE by setting
- Pod needs to know the node name it lives on via Downward API
- name: label-node
args: ["label", "node", "--overwrite", "$(NODE_NAME)", "sysctl/mm.transparent_hugepage.enabled=never", "sysctl/mm.transparent_hugepage.defrag=never"]
- name: NODE_NAME
Now to add the label restriction to pods that need it.
- key: sysctl/mm.transparent_hugepage.enabled
- key: sysctl/mm.transparent_hugepage.defrag
Putting it all together
Using it with a