Who will pay for Spectre? Probably you.

Owen Rogers
7 min readJan 24, 2018

This Toblerone feels lighter?

In 2016, Mondelez, the maker of Toblerone — the Swiss chocolate bar famed for its triangular prism shape — faced a challenge. One of its biggest markets, the UK, was experiencing a weak exchange rate as a result of the Brexit vote, which was increasing the costs of the raw materials needed to make the nougaty chocolate.

The company could raise prices, but this might reduce demand. It came up with a novel solution — it removed sections of the prism. The chocolate bar still sat in the same box and was in roughly the same shape, but big chunks had been removed from it. The company was reducing its costs by reducing the weight of the chocolate from 400g to 360g, while charging the same price. Buyers, who naturally assess value on a total, rather than per-gram, basis, were expected to not feel as penalized as by a raised price. The change faced backlash, but people accepted it — over time.

Today, the cloud and other sectors face a similar problem as a result of the Spectre and Meltdown due to a sudden rise in the cost of the raw materials used to provide their products, namely CPU capacity. The short-term solution to these major security vulnerabilities involves changes in BIOS, processor microcode and operating system software restricting access to memory space that is shared between different privilege levels (kernel memory space in most operating systems). This exacts a performance penalty because information transfers between privilege levels will now require more checking and work.

At a fundamental level, a compute cloud service is selling CPU hardware leased by the second. Cloud providers buy this hardware in bulk, and then divvy up the hardware into rentable pieces — a continuum of virtual machines, containers and ‘serverless’ functions. If, as a result of Spectre or Meltdown, each server is now able to deliver fewer CPU cycles s to lease than before, then each compute workload experiences a rise in costs for the cloud provider.

Cloud providers are particularly at risk — the performance impact will be especially bad where virtualization or containerization takes place. These are environments where there is a much greater level of context switching taking place in the processor that will be subject to the performance hit of the mitigations. There hasn’t been time for comprehensive characterizations, but initial results indicate the penalty could be as high as 30%. Social media platforms are awash with stories and screenshots showing CPU utilization increasing by double digits overnight, with these patches being mooted as the main cause. What can cloud providers do to account for this significant impact?

Many are already taking steps to reduce the impact of the patch. Google, for example, has come up with a means of hardening executables that minimizes the potential for problems and has a very small performance impact. Providers agree that steps must be taken to patch the vulnerability — less have talked about who will pay for the resulting reduction in performance, even if it’s just a few percentage points.

Do nothing

One option is that the provider patches the servers, and does nothing to its product or pricing — the Toblerone option. Users get less capability for the same price. For sake of argument, let’s say the patch adds 25% CPU processor overhead to the server. Previously, a script that took one hour at full virtual CPU (vCPU) utilization would now take 25% more time: 1 hour and 15 minutes. Another way to look at it is that the script would need 1.25 vCPUs (so rounded up to 2) to process the same task in the same amount of time. In the first case, the buyer would have to spend 25% more as a result of additional time; in the second case, the buyer would have to purchase an additional virtual machine, potentially doubling costs.

Cloud providers — AWS, Google, Microsoft, IBM and all the others — don’t really promise a level of performance. They promise access to a certain size of virtual machine. The quantity of resources assigned to a size of virtual machines is potentially fluid. What exactly is a vCPU, and has this been guaranteed? How contended is a gigabyte of storage? No guarantee has been made about what computation power (in real terms such as instructions per second) has been assigned to a virtual machine.

As such, its dimensions are up for redesign. This degradation in resource capacity is akin to reducing the resources assigned to a vCPU. Before, a CPU core might process two billion instructions a second; now, it would only be able to process 1.5 billion. But no one ever guaranteed it would be able to process all these instructions in the first place. So why would a cloud provider take the hit? As such, cloud buyers have little redress.

Our assumption in this worst case is that the vCPU is operating at full utilization, such that any reduction in underlying resources would reduce performance of the script. Most users overprovision capacity on virtual machines (the wiggle room) so that if demand bursts, there is capacity to deal with it quickly rather than spinning up another virtual machine instantly (which takes time).

So let’s say our script usually takes up 60% capacity. We’ve configured a trigger to spin up an additional virtual machine if demand rises above 60% (perhaps as a result of HTTP requests or interaction). As a result of Spectre and Meltdown, we would have the virtual machine’s typical utilization increase to 75%, and we’d have to change our trigger threshold to take account of this. In this case, there is no cost implication to what we were already paying, because our virtual machine can still cover the load.

However, there is increased risk exposure because the wiggle room we previously had has shrunk. We now have less time to spin up another virtual machine when the current one becomes saturated. Sometimes this reduction in wiggle room won’t matter; other times, where demand is growing rapidly or provisioning is taking longer than usual, the application might be unable to handle the load. Even with containers, serverless or any other cloud service, the rationale is the same — each instruction now costs more than it did before.

Just as with Toblerone, cloud providers here are disguising the loss in value by doing nothing, and buyers are getting less bang for their buck. The question is, will they notice?

The nuclear option

Another option for providers is to increase prices, and give users the same capability they had before. This would be bold indeed, which is exactly why Toblerone avoided it. Such a move would instantly raise prices by 25% for a CPU, even if users weren’t experiencing any issues. It would also be a PR disaster for the provider and the cloud industry. But there would be some positives: administrators wouldn’t need to rearchitect or rescale their systems. They wouldn’t have to reconfigure alarms and triggers, and life would carry on as usual (albeit at a 25% increase).

Don’t forget, however, that this problem would exist even if the app wasn’t hosted in the cloud. The buyer would lose 25% efficiency on their own hardware, so the cloud provider wouldn’t be doing it unreasonably. However, as per the Toblerone example, buyers aren’t rational. We think this option is highly unlikely, especially considering the deep competitiveness between the hyperscalers, and the fact that providers didn’t guarantee any level of CPU performance.

Absorb it

One option is for cloud providers to absorb the increased costs and cope with slimmer margins. This is a feasible plan. The Cloud Price Index previously demonstrated how cloud providers’ gross margins on virtual machines are still pretty healthy, with 28% being the very worst case. So cloud providers could say: we’re giving you more capacity on our hardware to account for the loss from Spectre, and we’ll charge you the same price.

This could result in margins dropping by 25 points, however — a significant amount that could impact investor confidence. And don’t be fooled into thinking this would just impact virtual machines and alternate forms of compute. Ultimately, it would affect all cloud services, from storage to databases that rely on CPU cycles to do work, far beyond just compute. The Cloud Price Index has found that cloud prices are coming down 2% a year (figure below), so potentially, providers could absorb this loss by not passing on future cost savings to consumers to make up the shortfall. But this option is unlikely — why give up margin when the provider is under no contractual obligation to do so?

Relative decline in cost of US public cloud across top 85% of IaaS market since October 2014 for two Cloud Price Index benchmark cloud applications, each defined as “baskets” of cloud services. Large application, consisting of compute, multiple storage and database services, support, and load balancers has declined at much slower rate relative to small application, consisting of just compute and storage.

Conclusion

This analysis was made intentionally simple to present an economic argument. In reality, there will likely be a broad spectrum of cost increases and performance losses across applications, operating systems, hardware and vendors.

Giving buyers more capacity is a generous but risky strategy. As such, we think the major cloud providers will take a wait-and-see approach. After all, they haven’t guaranteed performance or capacity, so why should they care? On paper, most decision-makers would rather pay more for things to work properly, which should drive price rises or cost absorption. But decision-makers are human, and when faced with a potential spend increase of 25%, would naturally like to mitigate the impact — even if it means waiting to see how things go, and taking on a bit of risk through increased server utilization.

Our advice for cloud buyers is not to panic. This cost increase would have existed regardless of your choice of venue, and it probably won’t be as bad as you think. See how things go, but be prepared to scale if performance is impacted. It might be time to reassess what levels of performance are required, and what price is worth paying for this performance. For many, cost increases will be trivial because their current low utilization workloads have wiggle room to grow. Ironically, the biggest increases will be for those that rightsized their workloads, and built highly scalable architectures that respond to granular changes in demand — so organizations that squeezed costs and lack that wiggle room will likely face bill increases.

Cloud providers are in a tricky spot, but one solution is to help buyers optimize themselves to save money. Now could be the time to engage them on better architectures, use of advanced or variable purchases, rightsizing virtual machines, or turning off unused resources. After all, this isn’t the providers’ fault — but it presents an opportunity to help build bridges with the customers they serve.

Originally published to 451 Research subscribers. @owenrog

--

--

Owen Rogers

Research Director, Digital Economics Unit at 451 Research. Architect of the Cloud Price Index, PhD in cloud economics, whinger.