Linux user namespaces might not be secure enough? a.k.a. subverting POSIX capabilities
I frequently see users asking for user-namespaces as a solution to all the perceived problems with container security on Linux. Not all container technologies have support for user namespaces, although it’s coming; for instance, Docker just recently merged support into its experimental branch. User namespaces will have great benefits for some users and, long-term, may be the right solution, but may not offer the significant improvements that users seek today. Furthermore, I do believe there is a perception problem with containers that outweighs the legitimate risks.
It’s quite simple: if you’re using Linux, why would you allow a service (or a user) to elevate to root if it should never be allowed to do so?
The ability to drop capabilities can prevent root escalation. It’s an extreme example, perhaps, but by dropping all capabilities and running as non-root, a container will be denied all capabilities normally reserved for root and, no matter what, should be denied ‘su’, ‘sudo’, ‘ping’, and any number of other potential vectors for root escalation.
“It questions why anyone would not use this feature.”
In this example, there would be no direct, legitimate means of obtaining root from inside of the container. Should an attacker somehow obtain escalation to uid-0, they would be unable to load modules, override directory access controls, or any other number of the privileges normally assumed by the 0 UID.
“You should always drop capabilities in your application, or before running your application. Container managers do this by default.”
I will not say such a container is unbreakable, but such a configuration is safer than running your process with the Linux default bounding set which includes CAP_SETUID, CAP_SYS_ADMIN, and other capabilities that would enable escalation.
It questions why anyone would not use this feature. If you’re running Linux, I’ll go so far as to say you should always drop capabilities in your application, or before running your application. Container managers do this by default, but it’s not necessary to use Docker, LXC, or Rocket to accomplish this. Without a container manager, you can simply use ‘capsh’ to manually manage capabilities. Nobody seems to do this, and it’s part of why container controllers are such the rage today. Best practices are useless unless they’re automatic, immediate, and accessible.
Introducing User Namespaces
Containers primarily restrict processes based on features Linux calls ‘capabilities’ and ‘namespaces’. To say a process does not have the capability to SETUID to root means that the syscall `sys_setuid` and its related syscalls are unavailable to that process.
“When a user namespace is created, the SYS_CAP_ADMIN capability is added.”
Namespaces in Linux, which are the defacto container-specific mechanisms in the kernel do not, themselves, generally prohibit a process from performing administrative syscalls, but rather control which resources those syscalls manage. For instance, when joining a network namespace, the kernel’s network syscalls and ioctls will only see and manage the network interfaces belonging to that namespace. The mount namespace manages the resources available to the mount syscall.
User namespaces, however, are in a way a namespacing of capabilities and rather than being subtractive, actually grant non-root users increased access to system capabilities. When a user namespace is created, the SYS_CAP_ADMIN capability is added. Restrictions are applied such that this allows some arguments to ‘mount’, but not all, and denies loading kernel modules.
“Linux capabilities may be completely subverted.”
The theory of the user namespaces mechanism is that instead of dropping privileges from ‘root’, we instead provide limited elevation of privileges to non-root user. This mechanism of a default-deny is generally preferred, but it’s tricky when we look at how it’s implemented in Linux.
If a (real) root user has had the SYS_CAP_ADMIN capability removed, but then creates a user namespace, this capability is restored for the (fake) root user. That is, before creating the namespace, ‘mount’ would be denied, but following the creation of the user namespace, the ‘mount’ syscall would magically work again, albeit in a limited fashion. While limited in function, it’s significant enough that given a (real) root user and a kernel with user namespaces, Linux capabilities may be completely subverted.
User Namespaces: to the rescue! or not!
There are two solutions to this problem:
- User namespaces everywhere. Don’t run code as (real) root. Your fake-root user, if she creates a user namespace, will only be able to elevate to her fake-root user. I know that Docker is introducing user-namespaces in a fashion sufficient to resolve this issue, at least for its containers. Any other container system which spawns its containers in a user-namespace would also eliminate this risk. The problem could remain, however, in any use of capability-restrictions which are not bundled with a container manager, such as DIY attempts at building secure systems.
- Disable user namespaces. Compile it out of your kernel and let POSIX capabilities do what they’ve always done.
There are distributions which intentionally ship without user namespaces, presumably for these reasons. The security-focused Grsecurity kernel also disables user namespaces by default.
Linux Security Modules such as AppArmor and SELinux may be used to tighten-down the system such that in the above exploitation scenario, CAP_SYS_ADMIN and various syscalls may be denied to the (real) root-user, preventing circumvention of the capabilities mechanism. Docker does this today.
Seccomp may also be used to filter system calls, but writing sufficient filters is like rolling a boulder uphill.
Is this is a vulnerability?
The kernel developers will probably say “no”. It’s working as designed. Arguably, POSIX capabilities have been broken by user namespaces, and that could be a breaking of the userland contract. However, applications still run, they’re just less secure.
Also, the ability to circumvent capabilities still requires the user be “root”. Thoughts on this are divided. Some believe that if a “root” user had no capabilities, they are not really root, despite their UID. Others believe that this UID is so special that if you simply have it, you’ve already lost.
I do think that POSIX capabilities for root-owned processes offers, today, a false sense of security that’s in contrast with how the Linux kernel has historically functioned. POSIX capabilities used to work a certain way, but no longer do. I think that’s a problem worth, at least, knowing about.
Eric Windisch is the CEO and founder of IOpipe, Inc. where he is working on tooling and services for developers of serverless applications.
(P S. Some notes on what this is article is NOT: This article is NOT, “VMs are better” or “Containers are insecure”. I am focusing specifically on issues with a particular code-path inside of Linux, as an issue with Linux. I believe that “VMs do not contain” and are host to a whole class of their own vulnerabilities tied to both their own OS and their compute architecture)