More CPU cores is seldom better, and here’s why

Reduce cloud costs by 57–80% by making sure you only pay for the CPU power you need. In a lot of Python cases, this is at no penalty to performance!

4 min readJan 10, 2023

Why you’re bound to lose out by increasing core count

Most Python programs, and a significant amount of non-dedicated code in other languages runs in only a single core. Why? Coding up a program that runs in parallel is much more complicated and exposes you to subtle bugs such as deadlock, lost writes, running out of memory etc. For many applications, the significantly slower time to write the code, and exposure to the above bugs, is not worth the performance benefit.

The vanilla data science stack — Python, Numpy, Scikit-learn, as well as the majority of Python packages run only on 1 core. Any package whose documentation doesn’t mention “Global Interpreter Lock” or “releasing the GIL”, or “parallel” is safe to assume runs only on a single vCPU. Examples that run in parallel include: TensorFlow, PyTorch, as well as packages that make writing parallel code easier: JobLib, Dask, Ray. This also applies to a variety of other applications (from web apps to gaming).

I have seen many cases where clients rent big and expensive AWS instances with dozens of cores, when the core being run on them benefits very little from the additional core count. AWS instances are priced in USD/hr, which makes the cost seem deceptively small, but it adds up to hundreds of USD per month. This waste was my motivation for writing this article.

If you scale up because you need more memory — try the memory-optimized instances or even consider the x2gd instances that use Amazon’s own Graviton ARM chips. These offer the best memory capacity-to-price ratio on AWS while also delivering superior Single-core performance.

How can you check

Run your program, then run htopin a new terminal. Look at the bars next to each core. If they stay mostly empty, or if the “Load” factor stays below 2 (meaning less than 2 cores on average are working flat-out), then you may want to downscale your instance.

Why — Amdahl’s Law

Amdahl’s Law models code by splitting it into a Serial and a Parallel part.

1. The Serial part represents the part of a program that cannot utilize multiple CPUs. Examples: reading/writing from disk, algorithms that depend on the result of previous iterations, such as Decision Trees.

2. The Parallel part — the part of the program that scales to multiple cores. Examples: vector and matrix operations, programs in which the order of operations doesn’t matter e.g. Stochastic Gradient Descent — it does not matter which data point (or mini-batch) you select first. Aggregates such as sum(), max(), avg() can also be parallelized.

Implications

Say you have a very well-optimized program where 90% of its runtime can be run in parallel perfectly. How many cores do we need to get 8x speedup?

Let’s use Amdahl’s Law equation

So we need 36 cores to get 8x speedup, even though our program is written very well (90% is parallelizable). Looking at Amazon pricing, a machine with at least 36 cores (not Hyperthreaded, so we need 2x vCPU) is the c5.18xlarge, starting at $3.06/hr

We can save some money by aiming for only 6x speedup. Plugging in the formula above with s_latency=6 and p=0.9, we get that we need 13.5 cores. The cheapest general-purpose 14-core machine on AWS is the M6a.8xlarge costing $1.38/hr. That is a 55% decrease in your cloud costs, if you’re OK with a 20% slower performance!

That is a 55% decrease in your cloud costs, if you’re OK with a 20% slower performance!

We can go even further, assuming 4x speedup is acceptable. In that case, we only need a 6 core CPU. The m6a.4xlarge is an 8-core CPU costing $0.69/hr. That is a 77% reduction in cloud costs, but at a 50% performance penalty.

This clearly shows the log-scale of performance benefit, vs. the linear increase in cloud costs as you scale up your instances! And this assumes you have very efficient code, where 90% is parallelizable perfectly. Another point to note is our code cannot get more than 10x speedup, no matter how many CPUs you throw at it. That’s because in our example, 10% of our code is inherently serial (e.g. loading data from disk) i.e. it doesn’t improve from more CPU cores.

How to find the portion of your code that runs in parallel

This is is the 90% above, and is sometimes also referred to as Amdahl’s Number. This can be checked by running your code in two instances of different sizes and plugging in values for s and s_latency in the equation above and solving for p.

Why — the GIL (Python Specific)

Python is fundamentally designed for single-core applications. This is because of the Global Interpreter Lock (GIL). This means that each instance of Python — each terminal window you open and run the python command — can only perform a single operation at a time. This means that 47 cores of your fancy 48-core CPU are sitting idle.

Even if you’re using a language without this limitation e.g. Matlab, which offers automatic parallelization for most operations, you still want to reconsider. This is due to the harsh realities of Amdahl’s Law

Appendix

If you’re an AWS/Cloud user and you need more memory (RAM), without needing the extra CPU, make sure to select the memory-optimized instances who charges quite a lot for their instances, even though their cents/hour price seems insignificant.