Time-travelling exploits with Meltdown

This past week there’s been a lot of speculation over two new processor vulnerabilities called Meltdown and Spectre. In this post I’ll talk about Meltdown. I’ll talk about Spectre elsewhere later.

So what’s going on, and what do you need to know?

First things first: if you’re a normal user at home, there’s nothing special you need to do. Install updates for your OS and software and you’ll be fine. You don’t need to buy a new computer or throw your processor away. The fixes slow down your computer but in such a tiny way you won’t notice. Install the update.

If you work as a technology person in a company, your cloud VMs will have already rebooted by now. That was your cloud provider fixing the cloud hypervisors, but you should still patch the OS on your cloud VMs too. Apparently Azure did some special witchcraft so even if your guest OS isn’t updated you will be immune from this bug, but Amazon definitely didn’t, and if you’re not installing security updates on your Azure VMs you’re a mad person and it’s only a matter of time before your customer’s data is on pastebin. So suck it up and install the updates.

If you’re a software developer, you also probably don’t need to care. The main exceptions are:

  • You‘re a software developer who writes the memory/trap code for virtual machine hypervisors. My condolences to you.
  • You’re a software developer who writes the memory manager for a major operating system (hi Landy!)
  • You’re a software developer who works on the JIT component of a web-browsers’ JavaScript execution engine.

But it’s still a cool bug even if you don’t need to do anything special to fix it. So how does it work?

If you prefer code to words, you’re a crazy person, but sure, here’s the exploit I wrote for it. If you prefer words, keep reading.

Zoomed-in view of an out-of-order speculative-execution pipeline

Time-travelling with modern processors

Modern processors are pretty complicated. They have a huge array of tricks up their sleeve to try and squeeze extra performance out of software applications.

Long ago, processor designers started turning application code into a “pipeline” of activity. If the processor comes across an instruction that’s going to take a long time — fetching memory from RAM, for example — they can start work on the next instruction immediately so as to avoid standing idle waiting for it to complete.

The problem comes if the first instruction does something wrong, like trying to access a memory address that isn’t there, after it already started work on the instruction after it.

What would happen if we were running instructions in the correct order? This is pretty simple. The first instruction would “fault”, the second would never run, and the processor would tell the operating system, which then gets to choose how to respond.

Despite the name, these “faults” are normal behavior. If the error occurs during a normal program’s execution, the OS can see if the program can handle the error, and if it can’t, the OS will cause the process to “crash”. The crash might be the end of the road for the program that contained the error, but the OS and processor happily keep going. That’s why you see the program evaporate and the desktop reappear, rather than watching your whole machine reboot.

But this all starts to get a bit complicated if the processor is running multiple instructions at once. If the processor sees a memory fetch instruction, begins to run the instruction after it, and only later discovers that the first memory load instruction should have faulted, what happens then?

The answer is the processor has to “unwind” the instructions that started after the instruction where the fault logically occurred. That way, the OS sees the program as though each instruction occurred in sequence. So far as the OS is concerned, the first instruction ran and that the second one didn’t.

Modern processors don’t just run one extra instruction at a time. They run dozens of instructions all simultaneously “in flight”, each one squeezing extra performance out of the tiny fractions of a second the processor might be otherwise standing idle waiting for previous instructions to complete. This means processors are always running huge numbers of instructions in a “pipeline” of activity that only stops when the processor discovers a fault on one of the instructions and then has to “undo” all the speculatively executed instructions after it.

After a while, processor designers discovered they could use this principle for another clever trick to squeeze even more performance out of a program. Code sequences happen in small sequences of instructions (called “basic blocks”) followed by jumps to other bits of code. The pipeline is really good at running these blocks. But what happens when the processor gets to the end of a block and needs to jump somewhere else? Waiting for the pipeline to complete is costly; we don’t like to wait.

If the jump is unconditional, the processor just steamrollers through, loading the new instructions from the next block and adding them to the end of pipeline.

But what if the jump is conditional on a result still working its way through the pipeline? In a sane world, the processor would wait. It would then discover whether the conditional jump should or should not be taken, and then take the correct path based on that result. But the land of processors is not a sane world, and standing idle is verboten.

So what do processors do? They guess. When the condition value eventually resolves in the pipeline, the processor can then go back and ask whether the guess was right. If it was, then huzzah! We avoided the wait and squeezed extra performance out of our processor by doing so. If we guessed wrong, well we just undo all the instructions currently in fight from the wrong branch and then go take the correct one. If we were correct in our guess, we’ve squeezed extra performance out of the program. If were wrong, we just wasted a bunch of time that we’d otherwise have been standing idle.

Obviously the more often we guess the condition right, the more times we “win” and eke out extra performance. It turns out that when it comes to branch prediction, past performance is a pretty reliable indicator of future performance. So modern processors have a branch-prediction unit whose job is to “learn” whether branches were taken based on empirical data, and use that information to predict future branches better.

The Oracle at Delphi is the earliest known prototype of a branch predictor

Cool. Now we know enough about modern processors to understand and exploit the Meltdown processor bug.

Meltdown

The “Meltdown” bug happens if an access to invalid memory occurs during a speculative path that has to be unwound. Consider, for example, the following pseudo-code:

Let’s first look at what happens in a sane world where the processor runs each instruction one-by-one.

First, the processor evaluates expr(), and assigns its result to register $A. Next the processor asks if $A holds the value 1. If it does, it begins executing the instructions in the block. The first instruction loads some address into $B. Next $B is treated as an address and its fetched from RAM into $C.

What if address is invalid? Well in that case, the processor will fault on line 4. Line 5 will never run.

But we don’t live in a sane world where instructions are run one-by-one, and this isn’t how modern Intel processors run this code.

Modern processors start the same way, by beginning to evaluate expr(). Then we come across the condition on line 2, but the processor doesn’t wait for expr() to complete before deciding whether or not to take the conditional path. It takes a guess, and its guess is based on the branch-prediction logic of the processor.

If the guess is to assume the branch will be taken, the processor then kicks off the new block by issuing the memory read for address on line 4. The processor then queues the memory read on line 5 behind it — it’s dependent on the line 4 read that’s still completing. Importantly, the address of that second read is based on the value of the memory fetched in the first read.

What happens if address is invalid? Well, in that case, we should fault on line 4. But wait! We don’t even know if we’re actually taking this branch yet. If expr() eventually resolves and we discover we should never have taken this branch in the first place, we will have to unwind past both line 5 and 4, including the fault that occurred on line 4. We’ll have time-traveled our way backwards and the error will never have occurred.

In practice, what this means is that we queue the fault on line 4 pending the resolution of expr() on line 1.

The next complication, as if there weren’t enough already, is that memory fetches can be invalid for one of two reasons. Either the address is actually invalid, and you’re trying to fetch memory from an address that points to nowhere, or the address is valid, but the program simply doesn’t have permission to read from it. This is usually because that memory belongs to the operating system if you’re an application, or because it belongs to the hypervisor if you’re a guest VM.

In Intel processors, if address is valid, but privileged, the processor queues up a fault, but the execution engine still keeps steamrollering through. This shouldn’t matter; the instructions beyond the privileged read should never be visible. Eventually the processor will unwind all of these instructions either back to the fault, or back to the branch if the whole branch was mispredicted.

It shouldn’t matter. But it does.

The time-travel only unwinds direct side-effects. But while the second memory access might not be directly visible to program code, it is indirectly visible to it because it affects the latency of other memory reads.

The Exploit for Meltdown

We can now put all of this together to understand how to exploit “Meltdown” to read kernel memory from an application, or to read hypervisor memory from a guest VM (assuming it’s not been mitigated, which we’ll get to later).

First, we do a bit of upfront work to “train” the branch predictor that a particular conditional branch will always be taken. Then we run a rigged version of it where the branch isn’t taken, but will be mispredicted to be taken by the processor. We design it so that the branch does something completely invalid during that mispredicted execution that leaks information we can see when the misprediction unwinds.

In our mispredicted speculative execution, we arrange for a memory fetch of an address that is valid, but privileged. In practice this means either accessing hypervisor memory from a guest VM, or accessing the operating system’s memory from inside an application. If we tried to do this outside of speculative execution, it would fault. But because we are executing speculatively the processor queues the fault and the speculative execution keeps going.

Next we arrange for a memory fetch whose address is dependent on the content of the privileged memory we fetched. If we’re careful and clear the memory caches before we do this, this second read leaves a detectable signature of what that value is in the processor’s memory cache, and we can detect this later.

Now all we need to do is wait for the speculative execution engine to catch up. Eventually it will finish computing expr and discover that it mispredicted the branch. When it does so it will unwind; going backwards in time, unwinding all of the instructions it took, including the fault, and then resume execution pretending nothing happened.

The processor might look like it’s gone back in time — we can’t see any of the direct results of the speculative execution — but we can measure the indirect ones. What we do is perform a series of very high-precision timing measurements of memory accesses to see which cache-line got loaded in the second memory access. That tells us what the value of the privileged memory was during the speculative execution.

By repeating this several times, we can slowly leak any memory from the hypervisor host, or the operating system’s kernel.

Mitigating the Meltdown

Microscope view of a processor unwinding speculatively executed instructions

Preventing attackers from exploiting “Meltdown” is a PITA. Speculative execution is architectural — it’s not something Intel can just update with new microcode. And in any case, doing so would slash the performance of the processor.

The trick that Windows and Linux both do is rather clever. Recall there’s two ways for a memory access to be invalid. It can either be an address that points nowhere, or it can be an address that points to valid data, but which the application doesn’t have permission to view.

The trick is to make the addresses that point to privileged memory also be invalid. They do this by evicting the entire operating system from memory just before the program is scheduled to run, and then warping suddenly back into view whenever the hardware or software makes a call that requires the OS to do something.

This is what Linux calls KPTI.

Now when an attacker tries to access a privileged address from inside a speculative block, the pipeline queues a fault, but it’s a fault because the address is asking for memory which isn’t there, not because the address is pointing to memory that is privileged. Importantly, because the address isn’t there, the CPU never gets told what the privileged memory contains, and so subsequent speculative instructions can’t leak it.

In other words, KPTI is a software mitigation that works around a hardware problem that can’t be directly fixed.

It’s also the reason for the “slowdown” you’ll have heard about in the press. Warping the kernel in and out of view with KPTI is expensive. Of course, how expensive depends on the exact processor architecture, and how needy the application is in constantly asking questions of the operating system.

Two Microsoft engineers test the updated system-call interface that warps the kernel in and out of view

Over the long-term, KPTI will likely eventually disappear. Non-Intel processors like AMD don’t do speculative memory loads, so aren’t affected, and next generation Intel processors will find ways to mitigate the issue in hardware in order to avoid the performance cost of KPTI.

In the meantime, if you install OS updates, your OS will quietly implement a bunch of awesome mitigations so attackers can’t exploit your processor, even if the processor itself can’t be remotely fixed.

It’s a pretty cool bug.