“Huge Dirty COW” (CVE-2017–1000405)

The incomplete Dirty COW patch

Eylon Ben Yaakov
Nov 27, 2017 · 10 min read

The “Dirty COW” vulnerability (CVE-2016–5195) is one of the most hyped and branded vulnerabilities published. Every Linux version from the last decade, including Android, desktops and servers was vulnerable. The impact was vast — millions of users could be compromised easily and reliably, bypassing common exploit defenses.

Plenty of information was published about the vulnerability, but its patch was not analyzed in detail.

We at Bindecy were interested to study the patch and all of its implications. Surprisingly, despite the enormous publicity the bug had received, we discovered that the patch was incomplete.

“Dirty COW” recap

First, we need a full understanding of the original Dirty COW exploit. We’ll assume basic understanding of the Linux memory manager. We won’t recover the original gory details, as talented people have already done so.

The original vulnerability was in the get_user_pages function. This function is used to get the physical pages behind virtual addresses in user processes. The caller has to specify what kind of actions he intends to perform on these pages (touch, write, lock, etc…), so the memory manager could prepare the pages accordingly. Specifically, when planning to perform a write action on a page inside a private mapping, the page may need to go through a COW (Copy-On-Write) cycle — the original, “read-only” page is copied to a new page which is writable. The original page could be “privileged” — it could be mapped in other processes as well, and might even be written back to the disk after it’s modified.

Let’s now take a look at the relevant code in __get_user_pages:

The while loop’s goal is to fetch each page in the requested page range. Each page has to be faulted in until our requirements are satisfied — that’s what the retry label is used for.

follow_page_mask’s role is to scan the page tables to get the physical page for the given address (while taking into account the PTE permissions), or fail in case the request can’t be satisfied. During follow_page_mask’s operation the PTE’s spinlock is acquired— this guarantees the physical page won’t be released before we grab a reference.

faultin_page requests the memory manager to handle the fault in the given address with the specified permissions (also under the PTE’s spinlock). Note that after a successful call to faultin_page the lock is released — it’s not guaranteed that follow_page_mask will succeed in the next retry; another piece of code might have messed with our page.

The original vulnerable code resided at the end of faultin_page:

The reason for removing theFOLL_WRITE flag is to take into account the case the FOLL_FORCE flag is applied on a read-only VMA (when the VM_MAYWRITE flag is set in the VMA). In that case, the pte_maybe_mkwrite function won’t set the write bit, however the faulted-in page is indeed ready for writing.

If the page went through a COW cycle (marked by the VM_FAULT_WRITE flag) while performing faultin_page and the VMA is not writable, the FOLL_WRITE flag is removed from the next attempt to access the page — only read permissions will be requested.

If the first follow_page_mask fails because the page was read-only or not present, we’ll try to fault it in. Now let’s imagine that during that time, until the next attempt to get the page, we’ll get rid of the COW version (e.g. by using madvise(MADV_DONTNEED)).

The next call to faultin_page will be made without the FOLL_WRITE flag, so we’ll get the read-only version of the page from the page cache. Now, the next call to follow_page_mask will also happen without the FOLL_WRITE flag, so it will return the privileged read-only page — as opposed to the caller’s original request for a writable version of the page.

Basically, the aforementioned flow is the Dirty COW vulnerability — it allows us to write to the read-only privileged version of a page. The following fix was introduced in faultin_page:

And a new function, which is called by follow_page_mask, was added:

Instead of reducing the requested permissions, get_user_pages now remembers the fact the we went through a COW cycle. On the next iteration, we would be able to get a read-only page for a write operation only if the FOLL_FORCE and FOLL_COW flags are specified, and that the PTE is marked as dirty.

This patch assumes that the read-only privileged copy of a page will never have a PTE pointing to it with the dirty bit on — a reasonable assumption… or is it?

Transparent Huge Pages (THP)

Normally, Linux usually uses a 4096-bytes long pages. In order to enable the system to manage large amounts of memory, we can either increase the number of page table entries, or use larger pages. We focus on the second method, which is implemented in Linux by using huge pages.

A huge page is a 2MB long page. One of the ways to utilize this feature is through the Transparent Huge Pages mechanism. While there are other ways to get huge pages, they are outside of our scope.

The kernel will attempt to satisfy relevant memory allocations using huge pages. THP are swappable and “breakable” (i.e. can be split into normal 4096-bytes pages), and can be used in anonymous, shmem and tmpfs mappings (the latter two are true only in newer kernel versions).

Usually (depending on the compilation flags and the machine configuration) the default THP support is for anonymous mapping only. Shmem and tmpfs support can be turned on manually, and in general THP support can be turned on and off while the system is running by writing to some kernel’s special files.

An important optimization opportunity is to coalesce normal pages into huge pages. A special daemon called khugepaged scans constantly for possible candidate pages that could be merged into huge pages. Obviously, to be a candidate, a VMA must cover a whole, aligned 2MB memory range.

THP is implemented by turning on the _PAGE_PSE bit of the PMD (Page Medium Directory, one level above the PTE level). The PMD thus points to a 2MB physical page, instead of a directory of PTEs. Each time the page tables are scanned, the PMDs must be checked with the pmd_trans_huge function, so we can decide whether the PMD points to a pfn or a directory of PTEs. On some architectures, huge PUDs (Page Upper Directory) exist as well, resulting in 1GB pages.

THP is supported since kernel 2.6.38. On most Android devices the THP subsystem is not enabled.

The bug 🐞

Delving into the Dirty COW patch code that deals with THP, we can see that the same logic of can_follow_write_pte was applied to huge PMDs. A matching function called can_follow_write_pmd was added:

However, in the huge PMD case, a page can be marked dirty without going through a COW cycle, using the touch_pmd function:

This function is reached by follow_page_mask, which will be called each time get_user_pages tries to get a huge page. Obviously, the comment is incorrect and nowadays the dirty bit is NOT meaningless. In particular — when using get_user_pages to read a huge page, that page will be marked dirty without going through a COW cycle, and can_follow_write_pmd’s logic is now broken.

At this point, exploiting the bug is straightforward — we can use a similar pattern of the original Dirty COW race. This time, after we get rid of the copied version of the page, we have to fault the original page twice — first to make it present, and then to turn on the dirty bit.

Now comes the inevitable question — how bad is this?

Bug implications

In order to exploit the bug, we have to choose an interesting read-only huge page as a target for the writing. The only constraint is that we need to be able to fetch it after it’s discarded with madvise(MADV_DONTNEED). Anonymous huge pages that were inherited from a parent process after a fork are a valuable target, however once they are discarded they are lost for good — we can’t fetch them again.

We found two interesting targets that should not be written into:

  • The huge zero page
  • Sealed (read-only) huge pages

The zero page

When issuing a read fault on an anonymous mapping before it was ever written, we get a special physical page called the zero page. This optimization prevents the system from having to allocate multiple zeroed out pages in the system, which might never be written to. Thus, the exact same zero page is mapped in many different processes, which have different security levels.

The same principle applies to huge pages as well — there’s no need to create another huge page if no write fault has occurred yet — a special page called the huge zero page will be mapped, instead. Note that this feature can be turned off as well.

THP, shmem and sealed files

shmem and tmpfs files can be mapped using THP as well. shmem files can be created using the memfd_create syscall, or by mmaping anonymous shared mappings. tmpfs files can be created using the mount point of the tmpfs (usually /dev/shm). Both can be mapped with huge pages, depending on the system configuration.

shmem files can be sealed — sealing a file restricts the set of operations allowed on the file in question. This mechanism allows processes that don’t trust each other to communicate via shared memory without having to take extra measures to deal with unexpected manipulations of the shared memory region (see man memfd_create() for more info). Three types of seals exist -

  • F_SEAL_SHRINK: file size cannot be reduced
  • F_SEAL_GROW: file size cannot be increased
  • F_SEAL_WRITE: file content cannot be modified

These seals can be added to the shmem file using the fcntl syscall.

POC

Our POC demonstrates overwriting the huge zero page. Overwriting shmem should be equally possible and would lead to an alternative exploit path.

Note that after the first write page-fault to the zero page, it will be replaced with a new fresh (and zeroed) THP. Using this primitive, we successfully crash several processes. A likely consequence of overwriting the huge zero page is having improper initial values inside large BSS sections. A common vulnerable pattern would be using the zero value as an indicator that a global variable hasn’t been initialized yet.

The following crash example demonstrates that pattern. In this example, the JS Helper thread of Firefox makes a NULL-deref, probably because the boolean pointed by %rdx erroneously says the object was initialized:

This is another crash example — gdb crashes while loading the symbols for a Firefox debugging session:

Link to our POC

Summary

This bug demonstrates the importance of patch auditing in the security development life-cycle. As the Dirty COW case and other past cases show, even hyped vulnerabilities may get incomplete patches. The situation is not reserved for closed source software only; open source software suffers just as much.

Feel free to comment with any question or idea about the issue ☺

Disclosure timeline

The initial report was on the 22.11.17 to the kernel and distros mailing lists. The response was immediate and professional with a patch ready in a few days. The patch fixes the touch_pmd function to set the dirty bit of the PMD entry only when the caller asks for write access.

Thanks to the Security team and the distros for their time and effort of maintaining a high standard of security.

22.11.17 — Initial report to security@kernel.org and linux-distros@vs.openwall.org
22.11.17 — CVE-2017–1000405 was assigned
27.11.17 — Patch was committed to mainline kernel
29.11.17 — Public announcement

Bindecy

Security research lab blog