Troubleshooting GitHub Action Runners with dind-rootless on GKE: Deep Dive

Published in

Google Cloud - Community

9 min readMay 10, 2024

TLDR: This article explores some of the challenges I encountered when deploying dind-rootless on GKE using the stock GitHub helm chart template. For readers looking for an installation guide of Action Runners with dind-rootless, please see my previous article, “Deploying GitHub Action Runners on GKE with dind-rootless”.

The Goal

My goal was to run a self-hosted GitHub Action Runner on GKE capable of building Docker containers. I wanted to enhance the Docker daemon security by using the docker:dind-rootless container, as opposed to the docker:dind (both available on Dockerhub). I based my deployment on the dind-rootlessconfiguration for the Runner Scale Set Helm maintained by GitHub.

The Problem

Deploying the sample GitHub dind-rootlessconfiguration resulted in an Action Runner environment that would fail when buildingdockercontainers defined the Workflow steps. When an Action Runner Workflow was triggered and picked up by the self-hosted Action Runner on GKE, the Action Runner logs on GitHub reported a message similar to: “unable to connect to /var/run/docker.sock”. Although the helm charts installed successfully and the worker pods launched when Workflow was triggered, the environment could not successfully build Docker containers.

The next section will give a brief overview of the Action Runner components and focus on areas related to the problem.

GitHub Action Runner Components

My previous article describes the two main components of GitHub Action Runners as follows:

GitHub Action Runners consist of two primary components: Action Runner Controller and Runner Scale Set. The Action Runner Controller (ARC) manages the life cycle of the Runner Scale Set, which in turn creates the Pods ultimately responsible for running the Action Runner Workflows, thus fulfilling the CI/CD requirements of self-hosted GitHub Actions runners on the environment of an organization’s choice.

The dind-rootless container is configured in the Runner Scale Set helm chart, so we will focus on that component for the remainder of the article.

Examining the Pod template specification of the Runner Scale Set

The Runner Scale Set Pod template is configured with 2 init and 2 standard containers. The init containers are responsible for configuring the environment for the docker and runner containers. The init containers handle tasks like setting up the user permissions for the runner user on the filesystem, and providing external libraries required for dind-rootless (as shown in the YAML below). The docker and runner containers are responsible for running the Docker daemon and executing the steps defined in the GitHub Action Workflow (respectively). The runner depends on working Docker daemon to build Docker containers.

Runner Scale Set Sample YAML (not working on GKE)

Below, I have pasted the Runner Scale Set template configuration YAML for dind-rootless template from the GitHub documentation. Please only use this as reference, as I’ll be reviewing the relevant pieces in the next section.

Note: This configuration will not work on GKE

## Warning: The top of this configuration has been truncated for brevity. Please see the docs for the full configuration: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#example-running-dind-rootless

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  spec:
    initContainers:
    - name: init-dind-externals
      image: ghcr.io/actions/actions-runner:latest
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
    - name: init-dind-rootless
      image: docker:dind-rootless
      command:
        - sh
        - -c
        - |
          set -x
          cp -a /etc/. /dind-etc/
          echo 'runner:x:1001:1001:runner:/home/runner:/bin/ash' >> /dind-etc/passwd
          echo 'runner:x:1001:' >> /dind-etc/group
          echo 'runner:100000:65536' >> /dind-etc/subgid
          echo 'runner:100000:65536' >>  /dind-etc/subuid
          chmod 755 /dind-etc;
          chmod u=rwx,g=rx+s,o=rx /dind-home
          chown 1001:1001 /dind-home
      securityContext:
        runAsUser: 0
      volumeMounts:
        - mountPath: /dind-etc
          name: dind-etc
        - mountPath: /dind-home
          name: dind-home
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///var/run/docker.sock
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
    - name: dind
      image: docker:dind-rootless
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
      securityContext:
        privileged: true
        runAsUser: 1001
        runAsGroup: 1001
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
        - name: dind-externals
          mountPath: /home/runner/externals
        - name: dind-etc
          mountPath: /etc
        - name: dind-home
          mountPath: /home/runner
    volumes:
    - name: work
      emptyDir: {}
    - name: dind-externals
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}
    - name: dind-etc
      emptyDir: {}
    - name: dind-home
      emptyDir: {}

Examining the dind container

The dindcontainer (see YAML below) is responsible for running the Docker daemon. It uses the environment configured by the init containers to start the Docker daemon and write the socket to the default location: unix:///var/run/docker.sock . While there are other options than unix for Docker socket configuration ( fd, tcp), using a unix socket helps to simplify and secure the IPC mechanism for the Docker daemon. K8s conveniently supports the use of emptyDir volumes in the multi-container pod that can be used to share socket files (docker.sock). Using this approach nicely isolates the responsibilities of the Docker daemon and client, allowing one container (dind) to produce the socket and one container (runner) to consume it.

    - name: dind
      image: docker:dind-rootless
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
      securityContext:
        privileged: true
        runAsUser: 1001
        runAsGroup: 1001
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
        - name: dind-externals
          mountPath: /home/runner/externals
        - name: dind-etc
          mountPath: /etc
        - name: dind-home
          mountPath: /home/runner

Examining the runner container

The runner container (see YAML below) is configured to run the Action Workflow using the Pod template spec in our values.yamlconfiguration. The runnercontainer uses a volumeMount (name dind-sock in the YAML below) to mount the emptyDir mounted by the dind container to write the docker.sock. The runner is configured with a DOCKER_HOST environment variable pointing to the Docker daemon, in our case /var/run/docker.sock.

    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///var/run/docker.sock
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run

Diagnosing the Problem

The GitHub Action Runner Logs show the following error when the docker build step is run from the Action Workflow:

/usr/bin/docker build -t 2138ed:255530562ad94a888abbfe71b0b5391e -f "[my-github-repo]/Dockerfile" "[my-github-action]"
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Error: Docker build failed with exit code 1

The dind container was running and the logs indicated that the Docker daemon socket had been written to unix:///var/run/docker.sock.

kubectl logs $POD -n arc-runners -c dind

time="****" level=info msg="Daemon has completed initialization"
time="****" level=info msg="API listen on /var/run/docker.sock"

However, upon closer inspection, there was no /var/run/docker.socksocket file written to the container filesystem.

kubectl exec -it $POD -n arc-runners -c dind -- ls -al /var/run/docker.sock
ls: /var/run/docker.sock: No such file or directory
command terminated with exit code 1

As a result, therunnercontainer (responsible for running) the Job, could not access the Docker daemon through the shared (emptyDir) volume at /var/run/docker.sock. This condition directly lead to the “Is the Docker daemon running?” messages in the GitHub Actions logs.

Upon further exploration of the dind container, I noticed that the directory /var/run directory (symlinked to /run on GKE) is owned by root:root. This means that the dockerd process within the dindcontainer running as 1001:1001 would not be able to write its unix socket file to the /var/run/docker.sock location on the filesystem. Unfortunately there was no indication in the Docker daemon logs that the program failed to write the socket, and the program did not crash.

Note: The runner Pod lifecycle is short, which can make examining the environment challenging. I have left a few cheap shell one-liners to help with troubleshooting in your own environment below:

# This command sets the POD env var to the last runner pod found in the "arc-runners" namespace.
# Run this command after triggering your GitHub Action so that you can get the latest Pod.
# You will see a "no resource found" error (printed to stderr) until the Pod is found.
POD=""; while [ -z "$POD" ]; do  POD="$(kubectl get pods -n arc-runners | awk '{print $1}' | tail -n1)"; done

# Shell into the dind container
kubectl exec -it $POD -n arc-runners -c dind -- /bin/sh

# Shell into the runner container
kubectl exec -it $POD -n arc-runners -c runner -- /bin/sh

Solution

I knew that I need to work around the filesystem permissions issue that I had discovered in /var/run/docker.sock, so I explored a few options until finding a resting place on my third attempt.

First Attempt

My initial instinct was to use the securityContext.fsGroupPod configuration option in the template to ensure that the /var/run volume mounts were owned by user 1001:1001 across the dind and runner containers.

This approach proved to be overly permissive. The securityContext.fsGroupoptions are configured at the Pod level and applies to all volume mounts in the Pod spec. There is currently no way to to apply the securityContext.fsGroup to one specific container. I began to more closely consider the security implications of changing the permissions /var/run directory at all, but decided to move forward with it to prove it out.

Second Attempt

My second instinct was then to use standard shell commands in init-dind-rootlesscontainer to set the permissions via chown to 1001:1001on the /var/rundirectory. While this command properly changed the /var/rundirectory user:group permissions, the Docker daemon was still unable to write it’s socket to /var/run/docker.sock. At this time I’m unsure why the write failed to the /var/run/docker.sock location, but decided to take a step back from using the/var/run directory at all.

Changing ownership of the /var/run directory has a potential for unwanted side effects in a k8s pod container. For example, the /var/run directory may be used by other programs in the container that we do not want user 1001:1001 to access. GKE (and OSS k8s) also uses the /var/run to store Pod specific resource, such as secrets (/var/run/secrets). If user 1001:1001 has access to the /var/run/secrets directory, we begin to degrade the strength of our rootless dind deployment be broadening the surface area that our runneruser can access. While it’s entirely possible to avoid changing ownership of the/var/run subdirectories, it still added to my concern that ownership of 1001:1001 on directory /var/run/. would still be overly permissive.

3rd Attempt (success!)

The eventual fix was to configure the location of the Docker socket to the home directory of the runner user ( 1001). I chose the path by simply appending /var/run to the runneruser directory (/home/runner) path, resulting in /home/runner/var/run. I chose this file path to communicate intent (i.e. this is still a /var/run directory), and ensure that the Docker socket would not end up in a directory that could be accessed directly by the code evaluated in the Action Workflow (e.g. /home/runner/_work). Although this path is a bit of a sore thumb in relation to the Unix Filesystem Hierarchy Standard, I believe that the ease of security make up for it.

Below, I have provided a sample of my working values.yml template used with the Runner Scale Set helm chart:

Note: For an installation guide, please see my other article: GitHub Action Runners on GKE with dind-rootless.

## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
githubConfigUrl: "https://github.com/[user]/[repo]"

## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
githubConfigSecret: "my-token"

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  spec:
    initContainers:
    - name: init-dind-externals
      image: ghcr.io/actions/actions-runner:latest
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
    - name: init-dind-rootless
      image: docker:dind-rootless
      command:
        - sh
        - -c
        - |
          set -x
          cp -a /etc/. /dind-etc/
          echo 'runner:x:1001:1001:runner:/home/runner:/bin/ash' >> /dind-etc/passwd
          echo 'runner:x:1001:' >> /dind-etc/group
          echo 'runner:100000:65536' >> /dind-etc/subgid
          echo 'runner:100000:65536' >>  /dind-etc/subuid
          chmod 755 /dind-etc;
          chmod u=rwx,g=rx+s,o=rx /dind-home
          chown 1001:1001 /dind-home
      securityContext:
        runAsUser: 0
      volumeMounts:
        - mountPath: /dind-etc
          name: dind-etc
        - mountPath: /dind-home
          name: dind-home
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///home/runner/var/run/docker.sock
      securityContext:
        privileged: true
        runAsUser: 1001
        runAsGroup: 1001
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /home/runner/var/run
    - name: dind
      image: docker:dind-rootless
      args: ["dockerd", "--host=unix:///home/runner/var/run/docker.sock"]
      securityContext:
        privileged: true
        runAsUser: 1001
        runAsGroup: 1001
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /home/runner/var/run
        - name: dind-externals
          mountPath: /home/runner/externals
        - name: dind-etc
          mountPath: /etc
        - name: dind-home
          mountPath: /home/runner
    volumes:
    - name: work
      emptyDir: {}
    - name: dind-externals
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}
    - name: dind-etc
      emptyDir: {}
    - name: dind-home
      emptyDir: {}

Follow-Up Areas

Although both the dind and runner containers are running as 1001:1001, their container specifications make use of the (aggregate) security permission securityContext.privileged: true. I would like to understand more about the specific host resources that the containers need to consume in order to reduce the scope of permissions granted through the privileged configuration. This elevated security requirement, aside from being another potential security issue, requires the Action Runners to be run on GKE Standard as opposed to GKE Autopilot.

Conclusion

I hope that this article will help others who may want to deploy Self-Hosted GitHub Action Runners on GKE with dind-rootless. Please let me know in the comments if this template worked for you, if there is anything that I can help explain better or explore, or if you have some other interesting ideas on how to make this all work. I much appreciate your time, so thank you for reading this.