How to explain the spectre bug to non-techies?

My colleague asked me to explain the recent CPU spectre bug and speculative execution to a non-tech person. It’s like a challenge.

For speculative execution, I immediately thought about this real-life example:

source: https://www.wkbw.com/news/the-newsweek-cover-you-wont-see-madam-president

Newsweek, as well as other magazines, pre-made different issues for both presidential candidates. This is their strategy to deliver the election result timely. Of course only one candidate would win, so part of the labor would be wasted.

This is speculative execution. Before you even know that the prerequisites for an action or a response will fulfill, you carry out the action anyway. If the conditions your action responses to has indeed happened, you end up delivering the results faster. Otherwise, you will need to throw away your work and restore everything to its initial state, as if nothing has changed.

How to explain the spectre bug then? The core idea of the spectre bug is that you can make use of the above speculative execution feature to acquire data you have no permission to access.

Let me use an analogy. Assume you are a private investigator. Your client asked you to find out if a person, who is a politician, was actually born in Hawaii. You don’t have the resource to carry out this investigation and you have few ways to get access to this information. What do you do?

One way (the spectre way) is starting a rumor that the politician will run for office. And you spread this rumor to a large newspaper which has tremendous resources to dig any information about this politician. The newspaper is very interested in making a dedicated issue to introduce this politician, so they formed a whole team to check on him, including where the politician was born. This is kinda of speculative execution, because the action is based on a rumor which will be proven wrong eventually. When the newspaper has found out that the rumor was just a rumor, they have to throw away their effort on making this special issue.

One journalist from the above newspaper was also your high school best friend. The journalist happened to be part of the investigation team. He has learned a lot about the politician. You can just buy your journalist friend a beer and ask anything you are interested pretending that it was just a casual chatting about work.

This is roughly the idea of spectre. You basically trick the cpu to fetch some information that you normally don’t have access to. And based on the information, you perform another data access, but to different memory locations you are allowed to access this time. When the cpu realizes that it has been tricked, it will put back what you have accessed and restore the values it has changed. But you can still access the information you want indirectly by measuring time, because if a data has been fetched by the second data access, it will be faster to fetch next time. In the above case, the newspaper is the cpu. And the information of interest is where the politician was born. You tricked the cpu to load the information by starting the rumor. The rumor turned out to be just a rumor. So the newspaper ceases the effort on writing articles on the politician. But what has been learned during the process is cached, you can simply access it from your journalist friend.

I don’t know if you are satisfied with my attempt on explaining the spectre bug with the above analogy. I think the most ingenious part of the spectre bug is that people can guess the content of a piece of inaccessible data by looking at the access time of some other accessible data. Who would know that accessing time could imply the content of some data!

This reminds me another bug I met while working at NVIDIA, my favorite one. It was a driver crash happened only on a prototype notebook after running a graphics stress test for days. Eventually the root cause was identified as a bit flip in system memory. A bit at a random location would flip from 1 to 0, or from 0 to 1, for no reason. We concluded that this bug was caused by memory quality issues. My colleague even joked that it was due to stronger sun spot activities, which was indeed happening at the time. (Because cosmos radiation could affect the stability of memory chips.)

Anyway, being trained as a pure software engineer, I am often under the illusion that computer hardware is trustworthy, reliable and rigid. You tell it 0, it won’t remember it as 1. This is the benefit of abstraction, we are hid from unnecessary details. But sometimes those details are useful, like in the above cases. The spectre POC code looks logically correct from a pure software point of view. But when you run it on hardware, that’s a different story.

I guess the more you work with computers, the more you can feel about the organic side of machines. When I’m frustrated with machines on problems or bugs, I often feel that my computers are not in a good temper, I have to conciliate them somehow as if they were just pets.