Meltdown / Spectre attacks in-depth

Adrien Mahieux — @Saruspete

While trying to be the most accurate possible, there still may be incoherencies, erroneous statements or typos. Please feel free to report them to me : adrien.mahieux [at] gmail.com

There are already multiple papers explaining in broad outline these 2 attacks, so I’ll try to keep this one technical and concise, with focus on x86_64 arch and Linux Kernel.

A few words on these attacks: as all hyped security issues, they have a name, a website [1] [2] (it’s the same) and a logo. You’ll find the official paper of Meltdown [3] and Spectre [4] on it. Their CVE are :

  • CVE-2017–5753 [6]: bounds check bypass (Spectre variant 1)
  • CVE-2017–5715 [5]: branch target injection (Spectre variant 2)
  • CVE-2017–5754 [7]: rogue data cache load (Meltdown)

The end result of these 2 attacks is the same : an unprivileged application can read (not write) all memory of any affected system, irrelevant of the isolation technique (virtualization, container, namespace…) or the OS (Linux, Windows, MacOS…).

This is effectively a problem when storing secrets (pgp, root password, certificates…) but does not provide privilege escalation per-se, although it can help exploit issues in software (eg, by defeating KASLR, read buffer-overflow canary value, getting the password hash and brute-forcing it…).

They target internal CPU implementations design, thus are not trivial to mitigate, requiring a joint effort of both the OS Kernel, the vendor and the applications.

Attack vector

A complete index of vendor advisories is available on the official attack sites [11].

A very extensive explanation and code are available on the Google Project Zero website [12].

Processor internal structure

While we usually refer to the CPU as a black box, its internals are quite standard among the manufacturers. Among the tenths of bricks assembled together to efficiently process data, the following parts of an x86 CPU are involved in these 2 attacks :

  • Memory Management Unit [18] : Programs uses Virtual Addresses [19] which are to be translated into different Physical Addresses [20]. The whole memory space is split in Pages (in x86, their default size is 4KB, or can cohabit with “HugePages” of size 2MB or 1GB) which are the smallest memory unit to be allocated. Because different programs or instances of the same program may use the same virtual @, they are only usable within a context, associating it with its dedicated Page-Table-Entry (which contains the process mapping between Virtual @ to Physical @). This PTE translation is done by the MMU, which is configured by the Kernel for its processes and other isolated elements. As this translation (known as Page Table Walking) requires multiple memory reads (thus precious time for something as common as memory read), all results are saved in a Translation Lookaside Buffer.
    The MMU is also in charge of checking the access authorization, which will trigger an exception if code running in his context tries to access memory it shouldn’t have access to.
  • Level 1/2/3 Cache [21]: Data is mostly saved in RAM, which is now quite large, but also slow compared to CPU frequencies. To speedup data access, a copy of the accessed data (and nearby memory space, as per spatial and temporal locality) is kept in a very fast memory inside the processor, called caches, ranging from small & fast (L1), to large & slow (L3 aka LLC / Last Level Cache) but still faster than RAM.
    L1 and TLB caches are often split in half, between “data” (L1D / DTLB) and Instructions (L1I / ITLB).
    L2 and L3 are often unified, and contains both data, instructions (and also TLB for L2, aka STLB).
  • Translation Lookaside Buffer [22]: When the MMU has decoded some virtual address, it keeps it in a dedicated cache, called TLB. As virtual addresses are not unique among the system, the content of this cache is either completely flushed upon context switch (when switching from a process to another, or processing an interruption), or selectively trashed (if the processor can figure it out).
  • Branch Prediction [23]: When the CPU comes to a branch (a test with 2 results possible: true or false), it must have the result to know which code path must be taken. To avoid wasting too much execution time waiting on slow operations (RAM access, complex calculation result…), modern CPUs try to guess the path and execute instructions ahead of knowing the real result. This is done by the “branch prediction unit”, which keeps track of previous results to make educated guess and increase their hit-ratio.
  • Pipeline [24] [25]: the main CPU processing queue, composed of :
  • Front-End. Loads the instructions, decode them to µops, and try to predict the branches.
  • Execution-Engine (aka Back-end). The Reorder-Buffer will reorder the µops, send them to the Scheduler, which in turn send them to the dedicated Execution-Unit to be processed. Among the Execution-Units, some of them are in charge of loading and storing data, and are connected to the Memory-Subsystem
  • Memory-Subsystem. Contains Load & Store buffers, L1 data cache, L2 cache
  • Out-Of-Order execution [26]: To avoid pipeline stall (where a sequential processing would block over an instruction waiting for data), the execution-engine of the CPU can reorder µops (if there is no dependency with the waiting instruction), execute them in the modified order, then let the Unified Reservation Station commit them in the initial order. While the execution of the µops was not made in the order requested by the application (assembly), the results will leave the system in the original order and provide the expected behaviour.
CPU Frontend simplified view, as shown in Meltdown official document

With branch-prediction, the CPU will issue a checkpoint (save his current internal state) and take the guessed path.

  • Upon successful guess, the pre-calculated results are committed / retired (made visible to the rest of the system), checkpoint discarded, and computation continue with these values, saving a significant waiting time.
  • Upon bad guess, the CPU pipeline is flushed (specifically the buffers and queues), and the checkpoint is restored. The calculations made during the speculation window are discarded, but the memory accessed and put in the CPU caches is not trashed.

To ensure coherency between the multiple processes, the caches uses indexes, tags and hints to differentiate the origin (problems of synchronization, homonym and synonym @), and a communication protocol (often MOESI [27]) to synchronize the caches between multiple cores.

On recent processors, the L1 is Virtually Indexed Physically Tagged (VIPT) which requires some work from the MMU, but index lookup and translation can be done in parallel. The TLB adds a valuable speedup here.

Common ground

This is why the proof of concept code is using 2 arrays:

  • “array1” will ultimately access memory that is not within the process’ memory space (the memory address being attacked / leaked), and use this value to calculate the address of an element in “array2”, which is valid and accessible within the process.
  • “array2” is a simple array, without any meaningful data. The interesting point is the time taken to access all possible values of array2: if the one being tested is noticeable faster than the others, it was in the cache, thus we deduct it was the value accessed speculatively by array1.

To avoid being killed by the MMU for trying to access memory it doesn’t have access to (Segmentation Fault) the program add a usual test “if (counter < array1.size)”, but the CPU had already speculatively executed the memory access code, loading it into cache.

When the CPU realizes it mis-predicted the branch, it discards the out-of-bound read value, but leaves the array2 value in the cache.

Now, the attacking process will scan all “array2” values, and measure how long it took for the CPU to get it. The value being significantly faster to load is in cache, so it was the wrongly pre-executed entry. As content was discarded, we don’t have its value. But we uniquely mapped each possible value to a different index, as we know the index of array2 being read, we deduct the value of the accessed memory.

Meltdown — CVE-2017–5754 — Cache-Data Leak

Intel Processors using “Out-of-order execution” do not check this privileged bit before executing the instructions. While this illegal read generates an exception (handled in the original sequential order), it is handled only at the end by the reservation-station, which interrupts the flow. Even so the application cannot get the read value to the rest of the system, the data was already leaked in cache.

AMD processors are believed to be not affected by this attack because they check the page accessibility before the speculative read execution, so the cache is never polluted, and data is not leaked.

Being able to see the Kernel-space memory, a process can see its data being cached, like opened files with mmap, shared memory… This way, you can leak secrets, like ssh key, certificate, shadow password, private emails etc…

The advertised memory read speed of Meltdown on an Intel Core i7–6700K was 122KB/s with exception handling, and 502KB/s with exception suppression.

When Simultaneous Multi Threading (SMT, known as Hyperthreading on Intel CPUs) is enabled, it only provides a set of registers to the CPU, but shares most of the other internal components (cache, branch-prediction-unit…). It’s then easier to mistrain the branch prediction of one thread if the malicious program executes on the other thread.

Spectre — CVE-2017–5753 — Bounds Check Bypass

However, the Linux kernel provides a JIT compiler namely eBPF since version 3.18, which allow users generate such a pattern and read kernel memory in a 4GB range.

Note that modern browsers (like Chrome, Firefox, Safari and Edge) also uses a JIT Compiler to execute Javascript code, so are the easy and valuable (sites visited by the user, credit card info, passwords, cookies..) targets for this attack.

The initial unoptimized code was able to read approximately 10KB/s on a Haswell i7 Surface Pro 3.
Google Zero Project reported 2KB/sec after a 4 sec startup. [12]

Spectre — CVE-2017–5715 — Branch Target Injection

As usual, the cache now contains the read entry, which can be guessed by timing-attack.

Google Zero Project reported read speed of 1.5KB/sec with 10 to 30 min startup, with room for optimization [12].

Mitigation patches and performance impact

Meltdown — CVE-2017–5754 — Cache-Data Leak

Now, while the kernel will keep seeing the full user-space memory, the application won’t see the kernel memory at all afterwards.

On Linux, this separation is called KPTI (Kernel Page Tables Isolation). The first implementation was called KAISER, aiming to limit the leaking of KASLR (Kernel Address Space Layout Randomization) and the bugs abusing memory management errors in programs and drivers.

As we now have different Page Table Entries, the MMU must flush the TLB between each context-switch (like a syscall, an interrupt or an exception) to keep the cache coherency. These two flush (one for each switch) have a severe impact on process that do frequent kernel entry/exits (numbers of up to 30% on I/O heavy workloads). Even for lower workloads, the complete TLB flush of the context switch will incur a lot of cache misses. The kernel scheduler uses a timer (usually configured at 1000Hz) to do CPU-time distribution across processes, which imply a context switch.

To lessen a bit the impact, x86_64 processors have a feature called Processor-Context ID (cpuflag PCID) [28] [29] similar to the Address Space ID (ASID) available in other architecture. It allows to filter the TLB entries currently allowed by the current context.

There is also the Invalidate-PCID (cpuflag INVPCID) [30] which allows fast invalidation of the TLB, based on the context value.

As PCID was not seen as an important nor critical addition, its adoption is relatively recent. That is :

  • It was implemented in Kernel 4.14. If you have the cpuflag but its use is not implemented in kernel, you won’t benefit from it
  • PCID was supported in hardware as soon as 2010 (Intel Westmere)
  • INVPCID only arrived with Intel Haswell.
  • When using virtualization, It must be implemented by host AND guest.

Before Meltdown, this feature had benefits only under highly stressed systems. Now, it’s a performance and security requirement to avoid huge TLB misses.

If you cannot bear this impact, the isolation can be partially disabled by adding the “nopti” or “pti=offkernel boot option.
You can also disable it during runtime, by mounting debugfs and setting the pti_enabled option to 0:

mount -t debugfs debugfs /sys/kernel/debug
echo 0 > /sys/kernel/debug/x86/pti_enabled

Please note that you should only do that if you have a very secure environment, with only trusted users and no unknown code that might run on the server.

Spectre — CVE-2017–5753 — Bounds Check Bypass

This variant of spectre is currently mitigated by multiple manual Kernel Patch, and insertion of serialization instruction (like lfence[31]), forcing the speculation to be resolved, instead of guessing it.

It’s not possible to disable this patch.

Spectre — CVE-2017–5715 — Branch Target Injection

Kernel + Microcode [32] features, and a compiler feature have been added to mitigate this flaw [33]

  • With Retpoline (Return Trampoline) [34] compiler option (-mretpoline for LLVM [34], and -mindirect-branch=thunk-extern for GCC [35]) generate assembly code that will trick the speculative execution in an infinite loop, shall this path be taken. Upon real branch result arrival, the infinite loop is discarded and no impact should be seen.
  • Indirect Branch Restricted Speculation (ibrs) will configure the newly provided MSR (Model Specific Register, configuration options specific to each CPU model) to limit the indirect-branch speculation directly by the CPU.
  • Indirect Branch Prediction Barriers (ibpb) like ibrs, but acts on the leakage of branch-predictor state across contexts.

The 2 new kernel features have an important impact on system performances for CPUs before Intel Skylake, so retpoline was proposed in the compilation build chain to mitigate the issue with lower performance impact.

On Intel Skylake processors, target for a ‘ret’ (return) instruction is also cached and predicted (using the Return Stack Buffer, and fallbacking to the Branch Target Buffer), making retpoline not 100% working, so the kernel will enable by default both ibrs and ibpb features [37]

From the RedHat Advisory [38], the values to use for ibpb and ibrs are automatically detected at kernel initialization:

  • ibrs = 1 : Only the kernel runs with indirect branch restricted speculation
  • ibrs = 2 : both userland and kernel runs with indirect branch restricted speculation (the default on AMD processors family 10h, 12h and 16h).
  • ibpb = 1 : IBPB barrier that flushes the contents of the indirect branch prediction is run across user mode or guest mode context switches to prevent user and guest mode from attacking other applications or virtual machines on the same host. In order to protect virtual machines from other virtual machines, ibpb = 1 is needed even if ibrs = 2.
  • ibpb = 2, indirect branch prediction barriers are used instead of IBRS at all kernel and hypervisor entry points (in fact, this setting also forces ibrs_enabled to 0). ibpb_enabled=2 is the default on CPUs that don’t have the SPEC_CTRL feature but only IBPB_SUPPORT. ibpb_enabled=2 doesn’t protect the kernel against attacks based on simultaneous multi-threading (SMT, also known as hyperthreading); therefore, ibpb_enabled=2 provides less complete protection unless SMT is also disabled.

To fix your system for this attack, you’ll need to :

  • update the microcode / firmware of your Intel [39] and AMD [40] CPUs (check with your OEM)
  • Update the kernel with one compiled with Retpoline mitigation enabled (if you have a pre-Skylake processor)
  • Update the kernel with one with the ibpb and ibrs patches included (if you have a Skylake or newer processor)

If you cannot bear the performance impact, these patch can be selectively disabled by adding the “noibpb” or “noibrskernel boot option, or during runtime, by mounting debugfs and setting the ibpb_enabled and ibrs_enabled option to 0

mount -t debugfs debugfs /sys/kernel/debug
echo 0 > /sys/kernel/debug/x86/ibpb_enabled
echo 0 > /sys/kernel/debug/x86/ibrs_enabled).

Retpoline can also be disabled with kernel boot option “noretpoline”.

Proof of Concept

Remember to NEVER run already compiled binaries : malicious people often take advantage of the fear provided by these flaw to infect your system by making you run malware.
Only download source code, check it so it does not makes anything weird (obfuscated data, embedded commands…), and compile the binary yourself.

Spectre (last 2 pages of [4] )
Meltdown.c [47]
Spectre-Meltdown-Checker.sh [48]
Spectre-Meltdown-PoC [49]

Thoughts about optimization

However, they give a hint to the compiler to reorder the generated code so that the likely path will be in a packed block, so to fit in less cache lines. [41] [42]

Other branch-prediction optimizations may be done in the code, without needing low-level architecture knowledge.

As an example, assume you have an array of int values and trigger multiple tests based on their value (higher / lower than…). If your data are randomly distributed, it’ll be very hard for the branch predictor to know which branch to take, as every value may be higher or lower than the previous one.
However, if you reorder (eg, by ascending order) your array beforehand, the branch prediction will be much more accurate, which will lead to less pipeline flush and higher throughput.

In 2007, when Theo De Raadt refused to support Intel-Core hardware, he made some statement about the hardware bugs [43] :

These processors are buggy as hell, and some of these bugs don’t just
cause development/debugging problems, but will *ASSUREDLY* be
exploitable from userland code.
[ … ]
- Basically the MMU simply does not operate as specified/implimented
in previous generations of x86 hardware. It is not just buggy, but
Intel has gone further and defined “new ways to handle page tables”
(see page 58).
- Some of these bugs are along the lines of “buffer overflow”; where
a write-protect or non-execute bit for a page table entry is ignored.
Others are floating point instruction non-coherencies, or memory
corruptions — outside of the range of permitted writing for the
process — running common instruction sequences.

Note that some errata like AI65, AI79, AI43, AI39, AI90, AI99 scare
the hell out of us. Some of these are things that cannot be fixed in
running code, and some are things that every operating system will do
until about mid-2008, because that is how the MMU has always been
managed on all generations of Intel/AMD/whoeverelse hardware. Now
Intel is telling people to manage the MMU’s TLB flushes in a new and
different way. Yet even if we do so, some of the errata listed are
unaffected by doing so.

As I said before, hiding in this list are 20–30 bugs that cannot be
worked around by operating systems, and will be potentially
exploitable. I would bet a lot of money that at least 2–3 of them are.

Now that we went through this, a small XKCD [44] for the fun :-)

https://xkcd.com/1938/

As a last word, a lot of security analysis should be added to performance related tools and optimizations, like PCM, MSRs and DMA.
For the time being, PCM can be used to detect attack [45], but they’re so powerful, I won’t be surprised if they got used as an attack vector.

Links