Meltdown / Spectre attacks in-depth
Adrien Mahieux — @Saruspete
While trying to be the most accurate possible, there still may be incoherencies, erroneous statements or typos. Please feel free to report them to me : adrien.mahieux [at] gmail.com
There are already multiple papers explaining in broad outline these 2 attacks, so I’ll try to keep this one technical and concise, with focus on x86_64 arch and Linux Kernel.
A few words on these attacks: as all hyped security issues, they have a name, a website   (it’s the same) and a logo. You’ll find the official paper of Meltdown  and Spectre  on it. Their CVE are :
- CVE-2017–5753 : bounds check bypass (Spectre variant 1)
- CVE-2017–5715 : branch target injection (Spectre variant 2)
- CVE-2017–5754 : rogue data cache load (Meltdown)
The end result of these 2 attacks is the same : an unprivileged application can read (not write) all memory of any affected system, irrelevant of the isolation technique (virtualization, container, namespace…) or the OS (Linux, Windows, MacOS…).
This is effectively a problem when storing secrets (pgp, root password, certificates…) but does not provide privilege escalation per-se, although it can help exploit issues in software (eg, by defeating KASLR, read buffer-overflow canary value, getting the password hash and brute-forcing it…).
They target internal CPU implementations design, thus are not trivial to mitigate, requiring a joint effort of both the OS Kernel, the vendor and the applications.
These attacks abuse performance-related features of CPUs. Meltdown seems to be effective only on Intel processors , while Spectre can be used against all x86 vendors (Intel , AMD  ) and some ARM  processors.
A complete index of vendor advisories is available on the official attack sites .
A very extensive explanation and code are available on the Google Project Zero website .
Processor internal structure
In the past, processor behaviour was defined by their foundry process (called Photolithography ). After some catastrophic CPU bugs (like Cyrix Coma , Pentium FDIV  or Pentium F00F ) requiring the hardware replacement of the bugged units, Intel added an internal decoding sequence. Instead of processing directly x86 opcodes, these CISC instructions will be translated into an internal language called “µops”  (“micro-operations”). This language (private to each processor vendor) is meant to be very simple, and to allow more optimizations at the various stages of the processing sequence.
By updating this firmware (called “micro-code”), vendors can now change the behavior of some parts of their CPU (at a certain extent), without having to physically replace them. Unlike BIOS updates (which are persistent), the updated micro-code updates are temporary (reset to the factory one by reboot). This requires the kernel, bios or another system component to re-apply the updated micro-code at every boot, but it avoid wrecking your costly processor if the firmware renders it unusable.
While we usually refer to the CPU as a black box, its internals are quite standard among the manufacturers. Among the tenths of bricks assembled together to efficiently process data, the following parts of an x86 CPU are involved in these 2 attacks :
- Memory Management Unit  : Programs uses Virtual Addresses  which are to be translated into different Physical Addresses . The whole memory space is split in Pages (in x86, their default size is 4KB, or can cohabit with “HugePages” of size 2MB or 1GB) which are the smallest memory unit to be allocated. Because different programs or instances of the same program may use the same virtual @, they are only usable within a context, associating it with its dedicated Page-Table-Entry (which contains the process mapping between Virtual @ to Physical @). This PTE translation is done by the MMU, which is configured by the Kernel for its processes and other isolated elements. As this translation (known as Page Table Walking) requires multiple memory reads (thus precious time for something as common as memory read), all results are saved in a Translation Lookaside Buffer.
The MMU is also in charge of checking the access authorization, which will trigger an exception if code running in his context tries to access memory it shouldn’t have access to.
- Level 1/2/3 Cache : Data is mostly saved in RAM, which is now quite large, but also slow compared to CPU frequencies. To speedup data access, a copy of the accessed data (and nearby memory space, as per spatial and temporal locality) is kept in a very fast memory inside the processor, called caches, ranging from small & fast (L1), to large & slow (L3 aka LLC / Last Level Cache) but still faster than RAM.
L1 and TLB caches are often split in half, between “data” (L1D / DTLB) and Instructions (L1I / ITLB).
L2 and L3 are often unified, and contains both data, instructions (and also TLB for L2, aka STLB).
- Translation Lookaside Buffer : When the MMU has decoded some virtual address, it keeps it in a dedicated cache, called TLB. As virtual addresses are not unique among the system, the content of this cache is either completely flushed upon context switch (when switching from a process to another, or processing an interruption), or selectively trashed (if the processor can figure it out).
- Branch Prediction : When the CPU comes to a branch (a test with 2 results possible: true or false), it must have the result to know which code path must be taken. To avoid wasting too much execution time waiting on slow operations (RAM access, complex calculation result…), modern CPUs try to guess the path and execute instructions ahead of knowing the real result. This is done by the “branch prediction unit”, which keeps track of previous results to make educated guess and increase their hit-ratio.
- Pipeline  : the main CPU processing queue, composed of :
- Front-End. Loads the instructions, decode them to µops, and try to predict the branches.
- Execution-Engine (aka Back-end). The Reorder-Buffer will reorder the µops, send them to the Scheduler, which in turn send them to the dedicated Execution-Unit to be processed. Among the Execution-Units, some of them are in charge of loading and storing data, and are connected to the Memory-Subsystem
- Memory-Subsystem. Contains Load & Store buffers, L1 data cache, L2 cache
- Out-Of-Order execution : To avoid pipeline stall (where a sequential processing would block over an instruction waiting for data), the execution-engine of the CPU can reorder µops (if there is no dependency with the waiting instruction), execute them in the modified order, then let the Unified Reservation Station commit them in the initial order. While the execution of the µops was not made in the order requested by the application (assembly), the results will leave the system in the original order and provide the expected behaviour.
With branch-prediction, the CPU will issue a checkpoint (save his current internal state) and take the guessed path.
- Upon successful guess, the pre-calculated results are committed / retired (made visible to the rest of the system), checkpoint discarded, and computation continue with these values, saving a significant waiting time.
- Upon bad guess, the CPU pipeline is flushed (specifically the buffers and queues), and the checkpoint is restored. The calculations made during the speculation window are discarded, but the memory accessed and put in the CPU caches is not trashed.
To ensure coherency between the multiple processes, the caches uses indexes, tags and hints to differentiate the origin (problems of synchronization, homonym and synonym @), and a communication protocol (often MOESI ) to synchronize the caches between multiple cores.
On recent processors, the L1 is Virtually Indexed Physically Tagged (VIPT) which requires some work from the MMU, but index lookup and translation can be done in parallel. The TLB adds a valuable speedup here.
These 2 attacks rely on indexed-load or indirect-branch: the addressed memory is not “fixed” in the instruction code (like a simple variable) but rather read from a register then accessed.
This is why the proof of concept code is using 2 arrays:
- “array1” will ultimately access memory that is not within the process’ memory space (the memory address being attacked / leaked), and use this value to calculate the address of an element in “array2”, which is valid and accessible within the process.
- “array2” is a simple array, without any meaningful data. The interesting point is the time taken to access all possible values of array2: if the one being tested is noticeable faster than the others, it was in the cache, thus we deduct it was the value accessed speculatively by array1.
To avoid being killed by the MMU for trying to access memory it doesn’t have access to (Segmentation Fault) the program add a usual test “if (counter < array1.size)”, but the CPU had already speculatively executed the memory access code, loading it into cache.
When the CPU realizes it mis-predicted the branch, it discards the out-of-bound read value, but leaves the array2 value in the cache.
Now, the attacking process will scan all “array2” values, and measure how long it took for the CPU to get it. The value being significantly faster to load is in cache, so it was the wrongly pre-executed entry. As content was discarded, we don’t have its value. But we uniquely mapped each possible value to a different index, as we know the index of array2 being read, we deduct the value of the accessed memory.
Meltdown — CVE-2017–5754 — Cache-Data Leak
Aside from translating Virtual @ to Physical @, the MMU is also in charge of checking memory access rights, like kernel-owned pages that may be accessed by user-land processes (allowing features like vdso, shared-memory and efficient transitions between user and kernel space, especially when handling interrupts).
As the kernel pages are flagged as privileged, processes cannot access them, or it will trigger an exception. This exception can be handled (code specified by the offending program will be executed to handle it) or suppressed (redirect the control flow using memory transactions and rollback), which gives the attacking program a way to check the cache after the speculative execution, then guess the value of the read memory.
Intel Processors using “Out-of-order execution” do not check this privileged bit before executing the instructions. While this illegal read generates an exception (handled in the original sequential order), it is handled only at the end by the reservation-station, which interrupts the flow. Even so the application cannot get the read value to the rest of the system, the data was already leaked in cache.
AMD processors are believed to be not affected by this attack because they check the page accessibility before the speculative read execution, so the cache is never polluted, and data is not leaked.
Being able to see the Kernel-space memory, a process can see its data being cached, like opened files with mmap, shared memory… This way, you can leak secrets, like ssh key, certificate, shadow password, private emails etc…
The advertised memory read speed of Meltdown on an Intel Core i7–6700K was 122KB/s with exception handling, and 502KB/s with exception suppression.
When Simultaneous Multi Threading (SMT, known as Hyperthreading on Intel CPUs) is enabled, it only provides a set of registers to the CPU, but shares most of the other internal components (cache, branch-prediction-unit…). It’s then easier to mistrain the branch prediction of one thread if the malicious program executes on the other thread.
Spectre — CVE-2017–5753 — Bounds Check Bypass
This attack can only target programs that present a specific read pattern, and not all applications provides it.
However, the Linux kernel provides a JIT compiler namely eBPF since version 3.18, which allow users generate such a pattern and read kernel memory in a 4GB range.
The initial unoptimized code was able to read approximately 10KB/s on a Haswell i7 Surface Pro 3.
Google Zero Project reported 2KB/sec after a 4 sec startup. 
Spectre — CVE-2017–5715 — Branch Target Injection
This variant trains the branch-prediction unit to guess a certain result, then changes some variable to invert the result of the trained branch. When doing this change, the processor is tricked by taking again the usual path, execute the read instructions (which are now out of the process boundaries and are not allowed), realize its prediction error and rollback.
As usual, the cache now contains the read entry, which can be guessed by timing-attack.
Google Zero Project reported read speed of 1.5KB/sec with 10 to 30 min startup, with room for optimization .
Mitigation patches and performance impact
Meltdown — CVE-2017–5754 — Cache-Data Leak
The only effective way to mitigate Meltdown is to separate the Page-Tables between the user-space and kernel-space. The previous setup where kernel and user had the same memory mapping was easier and faster to handle entry/exit of the kernel, while the MMU was doing the security checks.
Now, while the kernel will keep seeing the full user-space memory, the application won’t see the kernel memory at all afterwards.
On Linux, this separation is called KPTI (Kernel Page Tables Isolation). The first implementation was called KAISER, aiming to limit the leaking of KASLR (Kernel Address Space Layout Randomization) and the bugs abusing memory management errors in programs and drivers.
As we now have different Page Table Entries, the MMU must flush the TLB between each context-switch (like a syscall, an interrupt or an exception) to keep the cache coherency. These two flush (one for each switch) have a severe impact on process that do frequent kernel entry/exits (numbers of up to 30% on I/O heavy workloads). Even for lower workloads, the complete TLB flush of the context switch will incur a lot of cache misses. The kernel scheduler uses a timer (usually configured at 1000Hz) to do CPU-time distribution across processes, which imply a context switch.
To lessen a bit the impact, x86_64 processors have a feature called Processor-Context ID (cpuflag PCID)   similar to the Address Space ID (ASID) available in other architecture. It allows to filter the TLB entries currently allowed by the current context.
There is also the Invalidate-PCID (cpuflag INVPCID)  which allows fast invalidation of the TLB, based on the context value.
As PCID was not seen as an important nor critical addition, its adoption is relatively recent. That is :
- It was implemented in Kernel 4.14. If you have the cpuflag but its use is not implemented in kernel, you won’t benefit from it
- PCID was supported in hardware as soon as 2010 (Intel Westmere)
- INVPCID only arrived with Intel Haswell.
- When using virtualization, It must be implemented by host AND guest.
Before Meltdown, this feature had benefits only under highly stressed systems. Now, it’s a performance and security requirement to avoid huge TLB misses.
If you cannot bear this impact, the isolation can be partially disabled by adding the “nopti” or “pti=off” kernel boot option.
You can also disable it during runtime, by mounting debugfs and setting the pti_enabled option to 0:
mount -t debugfs debugfs /sys/kernel/debug
echo 0 > /sys/kernel/debug/x86/pti_enabled
Please note that you should only do that if you have a very secure environment, with only trusted users and no unknown code that might run on the server.
Spectre — CVE-2017–5753 — Bounds Check Bypass
Spectre attacks are more complicated to mitigate as they do not rely on a software / kernel behaviour.
This variant of spectre is currently mitigated by multiple manual Kernel Patch, and insertion of serialization instruction (like lfence), forcing the speculation to be resolved, instead of guessing it.
It’s not possible to disable this patch.
Spectre — CVE-2017–5715 — Branch Target Injection
As branch-prediction is a very central performance enhancement of current processors, we can’t just disable it. It’s also not possible to totally control it from the software, so some patches were added to mitigate its actions.
- With Retpoline (Return Trampoline)  compiler option (-mretpoline for LLVM , and -mindirect-branch=thunk-extern for GCC ) generate assembly code that will trick the speculative execution in an infinite loop, shall this path be taken. Upon real branch result arrival, the infinite loop is discarded and no impact should be seen.
- Indirect Branch Restricted Speculation (ibrs) will configure the newly provided MSR (Model Specific Register, configuration options specific to each CPU model) to limit the indirect-branch speculation directly by the CPU.
- Indirect Branch Prediction Barriers (ibpb) like ibrs, but acts on the leakage of branch-predictor state across contexts.
The 2 new kernel features have an important impact on system performances for CPUs before Intel Skylake, so retpoline was proposed in the compilation build chain to mitigate the issue with lower performance impact.
On Intel Skylake processors, target for a ‘ret’ (return) instruction is also cached and predicted (using the Return Stack Buffer, and fallbacking to the Branch Target Buffer), making retpoline not 100% working, so the kernel will enable by default both ibrs and ibpb features 
From the RedHat Advisory , the values to use for ibpb and ibrs are automatically detected at kernel initialization:
- ibrs = 1 : Only the kernel runs with indirect branch restricted speculation
- ibrs = 2 : both userland and kernel runs with indirect branch restricted speculation (the default on AMD processors family 10h, 12h and 16h).
- ibpb = 1 : IBPB barrier that flushes the contents of the indirect branch prediction is run across user mode or guest mode context switches to prevent user and guest mode from attacking other applications or virtual machines on the same host. In order to protect virtual machines from other virtual machines, ibpb = 1 is needed even if ibrs = 2.
- ibpb = 2, indirect branch prediction barriers are used instead of IBRS at all kernel and hypervisor entry points (in fact, this setting also forces ibrs_enabled to 0). ibpb_enabled=2 is the default on CPUs that don’t have the SPEC_CTRL feature but only IBPB_SUPPORT. ibpb_enabled=2 doesn’t protect the kernel against attacks based on simultaneous multi-threading (SMT, also known as hyperthreading); therefore, ibpb_enabled=2 provides less complete protection unless SMT is also disabled.
To fix your system for this attack, you’ll need to :
- update the microcode / firmware of your Intel  and AMD  CPUs (check with your OEM)
- Update the kernel with one compiled with Retpoline mitigation enabled (if you have a pre-Skylake processor)
- Update the kernel with one with the ibpb and ibrs patches included (if you have a Skylake or newer processor)
If you cannot bear the performance impact, these patch can be selectively disabled by adding the “noibpb” or “noibrs” kernel boot option, or during runtime, by mounting debugfs and setting the ibpb_enabled and ibrs_enabled option to 0
mount -t debugfs debugfs /sys/kernel/debug
echo 0 > /sys/kernel/debug/x86/ibpb_enabled
echo 0 > /sys/kernel/debug/x86/ibrs_enabled).
Retpoline can also be disabled with kernel boot option “noretpoline”.
Proof of Concept
If you want to check if your systems are vulnerable, a lot of examples are available on Internet.
Remember to NEVER run already compiled binaries : malicious people often take advantage of the fear provided by these flaw to infect your system by making you run malware.
Only download source code, check it so it does not makes anything weird (obfuscated data, embedded commands…), and compile the binary yourself.
Thoughts about optimization
Sometimes, branch-prediction is associated with “likely/unlikely” macro attributes in C, but there seems to be misconceptions about their usage. While Intel Netburst indeed introduced a dedicated opcode to some path hints to the CPU branch-predictor, this opcode is ignored since Core2 architecture. So placing this opcode just waste instruction space.
Other branch-prediction optimizations may be done in the code, without needing low-level architecture knowledge.
As an example, assume you have an array of int values and trigger multiple tests based on their value (higher / lower than…). If your data are randomly distributed, it’ll be very hard for the branch predictor to know which branch to take, as every value may be higher or lower than the previous one.
However, if you reorder (eg, by ascending order) your array beforehand, the branch prediction will be much more accurate, which will lead to less pipeline flush and higher throughput.
In 2007, when Theo De Raadt refused to support Intel-Core hardware, he made some statement about the hardware bugs  :
These processors are buggy as hell, and some of these bugs don’t just
cause development/debugging problems, but will *ASSUREDLY* be
exploitable from userland code.
[ … ]
- Basically the MMU simply does not operate as specified/implimented
in previous generations of x86 hardware. It is not just buggy, but
Intel has gone further and defined “new ways to handle page tables”
(see page 58).
- Some of these bugs are along the lines of “buffer overflow”; where
a write-protect or non-execute bit for a page table entry is ignored.
Others are floating point instruction non-coherencies, or memory
corruptions — outside of the range of permitted writing for the
process — running common instruction sequences.
Note that some errata like AI65, AI79, AI43, AI39, AI90, AI99 scare
the hell out of us. Some of these are things that cannot be fixed in
running code, and some are things that every operating system will do
until about mid-2008, because that is how the MMU has always been
managed on all generations of Intel/AMD/whoeverelse hardware. Now
Intel is telling people to manage the MMU’s TLB flushes in a new and
different way. Yet even if we do so, some of the errata listed are
unaffected by doing so.
As I said before, hiding in this list are 20–30 bugs that cannot be
worked around by operating systems, and will be potentially
exploitable. I would bet a lot of money that at least 2–3 of them are.
Now that we went through this, a small XKCD  for the fun :-)
As a last word, a lot of security analysis should be added to performance related tools and optimizations, like PCM, MSRs and DMA.
For the time being, PCM can be used to detect attack , but they’re so powerful, I won’t be surprised if they got used as an attack vector.