AWS vs GCP vs on-premises CPU performance comparison

Published in

Infrastructure adventures

13 min readMar 22, 2018

Recently I had the chance to participate in a project where we had to evaluate the price/value ratio of different cloud providers and had to compare it to existing on-premises hardware. During our research on the Internet, we found a surprisingly small amount of actual, useful benchmarks when it comes to raw CPU performance, so we decided to make our own.

The goal: gather data which can support a decision about which cloud provider to choose, and help exactly how many vCPUs you need to buy in the cloud, when you already know how many you normally use in a physical server in your own bare-metal environment.

This round of testing does not intend to be perfect and thorough, there are professional IT magazines who do that; we wanted to have quick and reliable benchmark data, which fits our needs. If you have more time, would be interesting to see detailed benchmarks with different kernels, before/after Meltdown-Spectre tests with different thread/CPU core count, etc.

The method

As a reference, I’m going to use a self-hosted physical server with a recent model of Intel Xeon. All the participants will be different Xeon models. Both on Amazon and Google you can only find Intel Xeon CPUs, literally nothing else, and this trend is pretty much the same in datacenters.

I made the tests using a Docker image of the well-known sysbench tool, but as a comparison, I did the same measurement with the binary, without using Docker. I found a <0.5% difference in multiple runs, so to make the testing procedure easier and ensure we use the exact same sysbench version with the same libraries (sysbench 1.0.13 (using bundled LuaJIT 2.1.0-beta2)), we decided to go all-in on Docker (CE 17.xx stable).

The following test commands were used:

docker run --cpus 1 --rm -ti severalnines/sysbench sysbench cpu --cpu-max-prime=20000 --threads=1 --time=900 rundocker run --cpus 2 --rm -ti severalnines/sysbench sysbench cpu --cpu-max-prime=20000 --threads=2 --time=900 rundocker run --cpus 8 --rm -ti severalnines/sysbench sysbench cpu --cpu-max-prime=20000 --threads=8 --time=900 run

Measurement time will be

10 seconds to see spike-performance and
15 minutes to see actual long-term performance.

We’re going to compare the CPU speed by events per second values of the test results.

On bare-metal, I made several tests to see if there’s a significant difference based on the operating system (and therefore, the kernel) used: I tested the same machine with CoreOS Container Linux stable (1632.3.0 — kernel 4.14.19), Ubuntu 14.04 LTS and CentOS 7. Again, the difference was measurement error category, so we are going to see the following operating systems:

on bare-metal: CentOS 7 and CoreOS 1632.3.0
on Amazon Web Services: Amazon Linux
on Google Cloud Platform: CoreOS 1632.3.0

Group 1: physical servers

The reference machine: a 2016-model Intel(R) Xeon(R) CPU E5–2690 v4 @ 2.60GHz.

On a single-core, single-thread setup, during a short 10-second test we get 303.13 events/second, while the long-duration test showed a slightly better performance with 321.84 e/s. We will take the 15-min result as 100% and compare everything else to this value.

Next we’re going to do the benchmark on 2 dedicated CPU cores, using 2 parallel threads. Interestingly, now the difference of the 10 vs 900 second benchmark seem to be very small: 670.61 vs 672.89 e/s. These results show that 2 CPU cores vs 2*1 CPU cores are 4.54% more performant on this specific Intel Xeon model.

Similarly, on 8 cores-8 threads, we get 2716.31 events per second, which gives us a +5.50% (or 105.50%) of the 8*1 CPU core performance.

So let’s compare this to other physical machines!

Competitors:

2014-model of Intel(R) Xeon(R) CPU E5–2660 v3 @ 2.60GHz
2013-model of Intel(R) Xeon(R) CPU E5–2658 v2 @ 2.40GHz
and for some fun, a 2009-model of Intel(R) Xeon(R) CPU X3460 @ 2.80GHz

As expected, the older the CPU, the slower it will be:

2016 → 2014 → 2013: 321.84 → 308.67 → 284.93 on the single core benchmark

Or in percentages, compared to the 2016 Xeon:

100.00% → 95.91% → 88.53% (1-core)
100.00% → 96.36% → 86.55% (2-core)
100.00% → 95.14% → 86.53% (8-core)

As you can see, on physical servers the CPU performance is linear with the number of cores and threads. The performance of n core vs. n*1 core is between 102–105%, similarly to the first tested model.

But hey, didn’t you mention 4 Xeons in the comparison?!

*drumroll* — the nearly 10 years old Xeon X3450 caused some unexpected surprises: it beat the crap out of all the newer brothers on the single-thread synthetic benchmark, by scoring an unbelievable 431.13 e/s value — that’s 133.96% of the 2016 reference model. Yeah, back then multi-threading was not really a thing for the average application.

Of course, as expected, this advantage melts down very quickly as we increase the thread count first to 2, later to 8: while on the dual-core setup we still achieve a sparkling 127.71% of the 2016 reference, on 8-cores we’re already at only 73.52% performance of the big brother (1996.96 e/s vs 2716.31 e/s). This CPU has 8 logical cores, so we cannot go any further with the tests.

The 10-second spike benchmark results, on premises

The 15-minute benchmark results, on premises

By the way, interestingly the benchmark showed the same results on the 20-core E5–2658 v2 with 40 threads (or 40 logical cores, as in Hyper Threading), with 60 threads, 80 threads or 160 threads — and until 40, it increased linearly: 10 core was 25% of the 40-core result, 20 core was 50%, 30 core 75%, etc. So looks like after you match the actual number of the logical CPU cores, increasing the thread count above that doesn’t gain you anything on the long term.

Takeaways from the physical machine tests

performance scales linearly with the number of cores: if you put more cores, you get linearly more performance
there seems to be about +5% gain each year in the new Xeon model, compared to the previous year’s
the old 2009-model Xeon is significantly stronger on single-thread workloads, but quickly loses as multiple threads appear

Relative performance compared to the 2016 Xeon E5–2690 v4

Multi-thread optimization vs. single-thread workflows, on premises

Group 2: Amazon EC2 instances

On the AWS platform, you have a ton of different instance types you can tailor for your needs, so we made tests with quite a lot of them. I also included here the suggested use-case of these instance types by Amazon:

reference: on-premises Intel(R) Xeon(R) CPU E5–2690 v4 @ 2.60GHz
t2 (basic): Intel(R) Xeon(R) CPU E5–2676 v3 @ 2.40GHz
m5 (generic): Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
c5 (high CPU): Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
r4 (high mem): Intel(R) Xeon(R) CPU E5–2686 v4 @ 2.30GHz
i3 (high IOPS): Intel(R) Xeon(R) CPU E5–2686 v4 @ 2.30GHz

Except for the base t2 type (2015), all the CPUs are 2016 or latest 2017 models, so they are all comparable to our reference. An interesting side note: these specific Xeon Platinum models are actually tailor-made for Amazon, you cannot buy them on the market.

Amazon is selling vCPUs, which is according to the fine print, logical CPU cores, with Hyper Threading enabled and not just the actual physical cores. These cores are normally not over-provisioned; while they are not shared “best effort” CPU cores, there’s no guarantee they don’t do optimisations between the different users on the same host. (With the micro instances, you have the option to buy partial cores shared between multiple tenants, for a much smaller price.)

So let’s go for the tests! After doing the same sysbench measurements, we arrived at the following values in the 10-second short test:

The 10-second spike benchmark results, AWS

You can already see:

the single-core performance is much better than our reference, with only 1 exception
but already with 2 threads, you start losing 10–25% compared to self-hosted physical hardware
the t2 seems like a very reliable, stable instance with bare-metal performance

Don’t forget Amazon might allow temporary spikes in your workload without rate-limiting your CPU performance. That’s why we did the 15-min benchmarks:

On the long-term, the physical instances showed a constant 105% performance compared to the single-thread results.

Again, the t2 acts like our own self-hosted servers, with a very predictable performance.

The rest is not so appealing, even best case we lose ~17%, which goes up to ~27% with the m5 generic-purpose instances. It means if you have 100 CPU cores in your data center, you need to buy 127 vCPU cores in Amazon to match the same performance.

AWS relative performance compared to the 2016 Xeon E5–2690 v4

Multi-thread optimization vs. single-thread workflows, AWS

Update: one of my colleagues pointed out that the t2 is a burstable type, unless the others; it works with so called “CPU credits”: https://aws.amazon.com/ec2/instance-types/#burst

So in general, this means either you will suffer from throttled performance by a synthetic benchmark (of 100% CPU usage) of consecutive 2 hours or you will need to pay a minimum of extra 5 cents per hour to get the unlimited CPU burst feature of the t2. Unless you know very well your application’s characteristics, this could lead to unpredictable costs.

I’m wondering whether it would feasible to destroy and recreate all my t2 instances every 23 hours, so I can stay on the fixed price, cheap high performance instance…? (Of course if the application and the infrastructure supports it.)

Group 3: Google Compute Engine instances

On the contrary to Amazon, Google offers a very simplified portfolio of instances: either you buy standard or CPU-optimized virtual machines — and that’s it. Even the CPU-optimized means you get the same standardized hardware, but with more CPU cores allocated, instead of giving more RAM for example.

Seems like they use a very simple, very flat hardware park and it probably helps them a lot with the maintenance. They don’t actually tell you what hardware is running in your VM when you do a cat /proc/cpuinfo, but by the frequency you can have a guess, because they claim to have the following portfolio:

2.6 GHz Intel Xeon E5 (Sandy Bridge)
2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)
2.3 GHz Intel Xeon E5 v3 (Haswell)
2.2 GHz Intel Xeon E5 v4 (Broadwell)
2.0 GHz Intel Xeon (Skylake)

On all of my tests I always received a 2.5 GHz model, the CPU info only said the following: Intel(R) Xeon(R) CPU @ 2.50GHz. This seems to be a 2013 model.

Since there’s only basically 2 kind of instances, the test was very quick and easy. I chose the n1-standard and the n1-highcpu types.

Let’s crunch the numbers!

The 10-second spike benchmark results, GCP

All the single-core results were better than our physical hardware (2016 Xeon), but only slightly. If it’s really the 2013 Xeon, then wow, all my respect to the Google optimization engineers!

As a reminder: Amazon had a 10–24% performance loss as we increased the number of cores. (Except for the very constant t2 instance.) Seems like Google is more or less the same so far.

Surprisingly, the high CPU instance was actually slower than the standard. But as I mentioned above, this is the same type of hardware, it’s just more cores than RAM compared to the standard instance.

Again, similarly to Amazon, Google allows you to have temporary spikes in your CPU usage, without throttling your available computing capacity. So therefore let’s see the long-term benchmarks:

Apparently, as we increase the workload, we get to lose constantly 15–22% of performance. On Amazon it was 17–27%.

Here unfortunately I didn’t see a t2 equivalent instance, it’s supposed to be the n1-standard, but it definitely does not perform like our physical machines.

GCP relative performance compared to the 2016 Xeon E5–2690 v4

Multi-thread optimization vs. single-thread workflows, GCP

Summary: AWS vs GCP

When you look only at the raw performance, Amazon seems to be very strong in the competition:

Relative CPU performance, AWS vs GCP compared to the 2016 Xeon E5–2690 v4

However, such a dumbed-down comparison is never really useful: Amazon offers lot of different instance types, which might have a weak CPU, but you get NVMe lightning-fast storage, etc. Sometimes that’s exactly what you need. Still, this article is only about raw CPU performance, so let’s see where the bill ends up:

On-demand prices for **8 vCPU** cores, Amazon vs Google

Now you can see it’s much more balanced! You get what you pay for.

In case you need smaller machines, the diagram might look slightly different — let’s say for dual core instances:

On-demand prices for **2 vCPU cores**, Amazon vs Google

Of course you can save a ton of money by using Amazon spot instances (a stock exchange-kind of licits on free computing capacity) or the preemptible Google instances (which can be turned off any time randomly by Google, but latest after 24 hours). For a real production workload, I don’t find it realistic that you could reserve all your capacity by hazardous bargaining to win 20–90% of discounts.

A realistic scenario might be to buy on-demand fixed instances for your usual core workload, then auto scale it with spot/preemptible cheap instances when there’s a peak of traffic. Also, for your QA environment the cheap should be perfectly fine — just adapt all your tools to manage correctly suddenly disappearing virtual machines and re-allocate resources dynamically. And of course, cloud is all about auto scaling: when you don’t have so many visitors during the night, you don’t need to pay for a lot of running instances. And this is one of the things where you can have a big gain compared to traditional on-premises infrastructures. (You don’t need to buy +200 physical machines with maintenance contracts, etc. only because you have every day a 2-hour peak, then those machines only consume electricity with 40% idle CPU…)

An additional option you can have: both providers also offer long-term discounts, if you commit on 12 or 36 months of continuous usage.

The cost of solution A or B is far more complex than just checking random instance hourly prices, when you start considering custom networking, storage requirements, bandwidth, etc. This article intended only to focus on the raw computing capacity comparison, as I found lack of up-to-date information on the Internet.

Key differences: cloud vs on-premises CPU performance

If there are a few key things we definitely realized by making this comparison:

on physical machines: if you add more CPU cores, you get linearly bigger performance
while on the cloud providers, it was only partially true: it increases linearly with the more vCPUs, but still you only tend to get ~80% performance of a physical machine (= you need to buy more CPUs in the cloud)
on single-thread, single CPU workflows the cloud providers win hands-down because they have the most expensive, biggest CPUs which are very strong on single thread

Update: feedback from the cloud providers

One of the two cloud providers gave us direct feedback on the results we achieved. They said the performance loss is due to using the Hyper Thread cores, instead of having the real ones, like in a bare metal test — because in the physical machine when you restrict Docker to 8 CPU cores, you still have maybe 12 more installed, ready for the OS to use for interrupts, etc.

So they suggested that if we need 8 real cores to compare to physical machines, we should opt for a 16 core instance to get the true 8 physical CPU cores reserved for us. One on hand, it absolutely makes sense, on the other hand it still means I need to buy 2x the size (and the price) of the instance to achieve/surpass the actual on premises performance…

To validate their claims, we did the same benchmarks on our on premises KVM cluster, assigning 8, 2, 1 vCPU cores, just like in the cloud. Then just to test what they suggested, we also did a round with +2 extra vCPUs, left only for the OS.

The results were consistent with our previous measurements from the non-KVM, on-premises hardware tests:

The 15-minute benchmark results, KVM on-premises

As you can see, it’s the exact same result: if you put 8x more virtual cores in KVM, you get 8x more performance. Not 6x more only or so.

Due to lack of time, I just did then a quick test in Google Cloud, using the above mentioned method: overprovision the available cores by a lot — so basically I need only 2 cores for my application, but I will buy 8:

The 15-minute benchmark results, GCP with overprovisioned resources

Yes, it’s true, here I got linear performance increase, just like with a bare metal — but for the price of buying 2x, 8x, etc. more than what I wanted to pay originally, while with the physical machines I did not have this limit, even with KVM virtualization.

Next step would be to do a real Java benchmark or some other more realistic performance test, but so far these results can be already used in plannings and calculations.

Thank you for taking your time to read this, I hope you also found it useful. Please feel free to share your thoughts or you if made a similar benchmark, would be nice to see how they compare with these results.