Resource limits, mlock, and containers, oh my

Jason Gerard
4 min readNov 15, 2018

--

Problem statement: You need to run a process as a non-root user inside Kubernetes and prevent this process from swapping any memory to disk.

This exact problem was something I encountered and led me down several dead ends until I came up with the solution that follows. I had a few more stringent requirements that I have left out of this document, but the basic approach is unchanged. TLDR at the bottom.

The first thing you need to figure it out is how to keep the process from swapping. This is easy enough. There are two syscalls for this, mlock and mlockall. The first locks a set of pages in memory while the latter will lock all pages in memory. See the man page here.

package mainimport (
"fmt"
"syscall"
"golang.org/x/sys/unix"
)
func main() {
err := unix.Mlockall(syscall.MCL_CURRENT | syscall.MCL_FUTURE)
fmt.Println(err)
}

This program simply attempts to lock all current and future memory pages, prints the value of err and exists.

If you build and run this program on a Linux system you should get the following output:

$ ./app.o
cannot allocate memory

If you got function not implemented then you are running on macOS or Windows which do not support mlock/mlockall. Since we’re talking containers, we’re assuming Linux as the runtime environment.

In order for this to succeed you must run as root.

$ sudo ./app.o
<nil>

Success!

Ok, now we know this works lets build a container and run it. Here is the dockerfile:

FROM alpine:3.8
COPY app.o .
CMD ["./app.o"]

Build and run it:

$ GOOS=linux go build -o app.o
$ docker build -t mlockex:latest .
Sending build context to Docker daemon 3.92MB
Step 1/3 : FROM alpine:3.8
... (elided)
Successfully tagged mlockex:latest
$ docker run -it mlockex
cannot allocate memory

What?! Why didn’t this work? Aren’t we root by default in Docker? Let’s add some better error handling to see what’s up.

Ok, let’s try this again. This time after building and running we get:

$ docker run -it mlockex
ENOMEM: nonzero RLIMIT_MEMLOCK soft resource limit

Back to my statement about being root in Docker. While yes, if you run whoami in your container you will see that you are root, you have, however, a limited set of capabilities you can use by default. They are documented here. If you need a refresher on Linux Capabilities and I have not already bored you to sleep, you can find the man page here.

If we go back to the man page for mlock we see that it states that the RLIMIT_MEMLOCK “is not enforced if the process is privileged (CAP_IPC_LOCK).” Aha! IPC_LOCK is not one of the default capabilities granted by Docker.

Luckily for us, Docker provides a convenient way to specify capabilities.

$ docker run -it --cap-add IPC_LOCK mlockex
Great Success!

Alright, we’re making some pretty good progress. Time to revisit our dockerfile and add our unprivileged user.

Let’s run it!

$ docker run -it --cap-add IPC_LOCK mlockex
ENOMEM: nonzero RLIMIT_MEMLOCK soft resource limit

What? Why did this come back? Aren’t we privileged? We added the capability, so what happened?

USER appuser

That line undid our privilege. Referring to the man page for capabilities we can find a critical statement near the bottom.

If the effective user ID is changed from 0 to nonzero, then all capabilities are cleared from the effective set.

The USER appuser statement causes the UID in the container to change, wiping out our IPC_LOCK privilege. So how do we fix this problem?

We must preserve the IPC_LOCK privilege, for that we’ll use setcap.

Our updated dockerfile adds the libcap package to get access to the setcap binary. Then we set the capability on our executable. +ep means to set the effective and permitted flags for this capability. Note: this will not work in a container using aufs. You must use a filesystem that supports extended attributes such as overlayfs2.

If you rebuild and rerun the image you should get success:

$ docker run -it --cap-add IPC_LOCK  mlockex
Great Success!

If you remove the cap-add flag from the run line you will get an error:

$ docker run -it  mlockex
standard_init_linux.go:190: exec user process caused "operation not permitted"

So what gives? Why do we have to set the capability on the file AND set it on Docker, especially when it gets wiped out by the change in UID? This is related to the way some capabilities work in Linux. IPC_LOCK in particular requires the capability to be in the parent user namespace. We started in the root namespace and then switched to appuser. More detail can be found here: https://lwn.net/Articles/420624/

Ok, let’s put this all together and run inside minikube. Minikube install instructions are here.

We need to create a YAML file to define our Pod in Kubernetes.

Once you have minikube up and running use the following script.

Finally, our output.

$ ./kubeit.sh
Sending build context to Docker daemon 4.03MB
Step 1/8 : FROM alpine:3.8
... (elided)
Successfully tagged mlockex:latest
pod "mlockex" deleted
pod/mlockex created
mlockex 0/1 Completed 0 1s
Great Success!

There you have it. I hope you enjoyed this journey and I hope this post saved you a few days of head banging and hand wringing.

TLDR; You must use Linux capabilities to disallow swapping using mlockall as a non-root user. Example code available in this repo: https://github.com/jasongerard/mlockex

--

--