Meltdown & Spectre… or sometimes the solution is not “let’s chuck more compute power at the problem”

by Alex Kinnane, Rob Boulton, Mark Barnes

FT Product & Technology
FT Product & Technology
4 min readFeb 14, 2018

--

You will probably have heard of the meltdown and spectre vulnerabilities and their associated fixes.

You may also have heard of the performance impact of those fixes.

At the FT we have a number of monitoring tools that we use to inform us about system health and performance.

With a degree of irony not lost on the infrastructure delivery team the systems that suffered worst from that performance impact were the very tools we use to understand other systems’ performance.

Now the dust has settled, from what we called “black thursday”, the learnings we at the FT (responsible for the affected monitoring tools) can draw are actually more about system and software architecture than about processors being built for speed not security.

Let us illustrate with a couple of graphs.

Pingdom graph 1

This Pingdom graph shows the performance and downtime of the SaaS tool we use to understand the health of other services. The tool is AWS (Amazon Web Services) hosted, you can see from the blue line the response speed suddenly jumped — to the point where it became occasionally completely unavailable. Working with the operations team of the SaaS provider we disabled some functionality so that we could restore a level of service. The performance only really settled down properly once our providers had resized the EC2 instances involved (the last red bars on the graph). We were then able to re-enable the functionality we had stopped earlier.

So far so ‘let’s chuck compute power at the problem’…

The next set of graphs tell a more interesting story.

These are from our Grafana instance and show the performance of our graphite servers.

Grafana graph 1
Grafana graph 2

These servers are hosted on AWS in two different regions (for resiliency reasons). You can see that at some point the disk performance measured by IOPS (Input/Output operations per second) fell off a cliff on 4th January. The drop in the graph towards the left. There is a time difference in Germany and Ireland as we think AWS patched at slightly different times in those regions.

This fall in IOPS led to ….

Grafana graph 3

…Dropped metrics.

Now, dropped metrics are BAD. This means our delivery and operations teams do not get all the information they need to tell them about the health of their systems.

You can see from the troughs in Grafana graphs 1 and 2 that we battled to get some improvements (this included the “let’s chuck compute power at the problem” solution) .

However the real solution lay in improving the software architecture. Engineer Alex had the idea to up the number of internal python processes to take better advantage of the number of CPUs in the EC2 instances involved. Working with engineer Rob the result was dramatic. Again looking at grafana graph 1 and 2 but this time towards the right, you can see IOPS leapt to roughly twice their previous levels. Looking at grafana graph 3, we stopped dropping metrics.

So, apart from having similar root causes are there other similarities here? Well yes. We are familiar with the architecture of the SaaS tool and there were processes running on the EC2 host that provides the User interface as well as processes doing some of the heavy lifting of user requests. Those heavy lifting jobs really did not need to be on there. They should be on their own EC2 instance. Those were the pieces of functionality we had to disable to restore service.

The lesson for us is clear, it is so easy to get used to ‘Moore’s law’ that we can become lazy about system and software architecture. Sometimes the solution is not “chuck more compute power at the problem”. We need to think a little harder about the way our systems are designed.

--

--