The Reasons Behind Numpy’s Speed are Often Misunderstood— Part 2
The actual reason Numpy is so fast, and an introduction to Numba — a trick that can impress interviewers!
As always, I’ve uploaded the notebook I used to Colab and GitHub
This is a follow-up to my Part 1, where I discuss the misconceptions found on this site, and on the web in general. From last time, we had one remaining theory why Numpy could be so fast: Compilation. Let’s test that theory:
Sum in a for loop test
0. Basics — create a big array of numbers (~8GB in this case)
import numpy as npmy_array = np.random.randint(low=0, high=1000, size=(1000000000))
1. Sum using np.sum()
%%timeit -n 3np.sum(my_array)- 164 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
Numpy can sum 1B numbers in literally a second of a second. That’s the same order of magnitude as the number of seconds in 100 years! Very impressive!
2. Sum using Python loops
%%timeit -r 2my_sum = 0for i in my_array:
my_sum += my_array[i]- 1min 8s ± 889 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Pretty slow, as expected.
But, we learned last time that one of the reasons Numpy is fast is that it’s compiled! Let’s try that for our loop code
Introducing Numba, a trivial to use C++ compiler for your Python code!
3. Sum using Numba on Python code
from numba import jit@jit(nopython=True)
def fast_sum(my_array):
my_sum = 0 for i in range(len(my_array)):
my_sum += my_array[i]
return my_sum
As you can see, the code is literally the same loop code as above, just in its own function and with the @jit thing on top, called a decorator. That’s all that Numba asks you to do. Let’s see the results:
%%timeitfast_sum(my_array)
- 160 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Same speed as Python! Not bad for literally one simple trick!
We got all the compiler benefits of fast execution, vectorization, instruction pipelining etc. by just importing a library.
4. Going even faster than Numpy — a nice coding interview trick!
If libraries are allowed, that is.
from numba import njit, prange # parallel range@njit(parallel=True)
def parallel_fast_sum(my_array):
my_sum = 0for i in prange(len(my_array)):
my_sum += my_array[i]
return my_sum
This is the code, using a parallelization across the loop counter i
, most likely using OpenMP in the background
%%timeitparallel_fast_sum(my_array)- 68.9 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2–3x faster than Numpy, again all it took was a function decorator. This may come useful, especially if you’re being judged on the speed of your code ;)
A more realistic (and complicated) example
I‘ll use the Radial Basis Function Neural Network. It is essentially a 1-hidden layer NN with the RBF (Gaussian) function as an activation function. I will be implementing only the training function, for demonstration purposes.
A crucial part of the RBF NN training is calculating the matrix H (Phi in the picture), which is a square (n x n) matrix defined as
The math above shows two for
loops. Why?
Because both i and j are both integers in range [1,n], and to fill up H, we need to do the above calculation O(n²) times.
For the more math-oriented, the above calculation computes the probability we observe data point (vector) X[i,:], assuming the mean is data point µ=X[j,:] and the Standard Deviation/spread is σ (which we choose).
Because every data point gets its turn being the mean μ, and every choice of μ needs to compare all n data points ⇒ O(n²)
Finally some Code! Difference between two very similar-looking implementations
0. Basics — Setup
Because the RBF kernel is so expensive, we’ll use the tiny Iris dataset
import numpy as np
from sklearn import datasets
from numba import jitiris_data = datasets.load_iris().data
iris_label = datasets.load_iris().target
1. Let’s implement the RBF kernel H generation and run some tests
def generate_H_slow(data, spread):
# Double for-loop (slow) version
# data - label-free dataset
# spread - sigma, standard deviation
n = data.shape[0]
H = np.zeros((n,n))
for j in range(n):
W = data[j,:]
for i in range(n):
H[i, j] = np.exp(-((data[i,:]-W) @ (data[i,:]-
W).T)/(2*spread**2))
return H
Results:
%%timeitgenerate_H_slow(iris_data, 1)- 42.3 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not great, not terrible. Actually, it is quite terrible. Iris has just 150 rows, with 4 features. What happens when you wanna run this filter over e.g. image data, where each pixel becomes a data point?
Nothing happens, because you’ll be waiting for the algorithm to run for weeks. And a faster computer won’t help!
Well, we can redesign the whole approach to be optimized around Numpy, with fewer loops etc.
def generate_H_fast(data, spread):
n = data.shape[0]
H = np.zeros((n,n))
for j in range(n):
# First, let's convert this to a Matrix subtraction
# 1. We subtract W from a different row of data in each step. Let's instead subtract W from all data
W = data[j,:]
# data - W should be equivalent to data.iloc[i,:]-W, for all i bcs. of Numpy broadcasting -https://numpy.org/doc/stable/reference/generated/numpy.broadcast.html#numpy.broadcast
H[:, j] = np.exp(-np.linalg.norm(data-W, ord=2, axis=1)**2/(2*spread**2))
return H
2. Let’s reimplement everything cleanly in Numpy
%%timeitgenerate_H_fast(iris_df.drop(columns="species"), 1)- 1.14 ms ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Super impressive, 40x speedup!
But nobody has time to redesign their whole approach!
That requires deep knowledge of the algorithm you’re optimizing (not common in industry, as you’re trying many approaches), and of Numpy.
Let’s pull the same trick as before, just throwing Numba at it
3. Trick — original algorithm + Numba
@jit(nopython=True)
def generate_H_numba(data, spread):
# Double for-loop (slow) version
# data - label-free dataset
# spread - sigma, standard deviation
n = data.shape[0]
H = np.zeros((n,n))
for j in range(n):
W = data[j,:]
for i in range(n):
H[i, j] = np.exp(-((data[i,:]-W) @ (data[i,:]-
W).T)/(2*spread**2))
return H
Results:
%%timeit
generate_H_numba(iris_data, 1)- 1.91 ms ± 9.89 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
20–30x speedup with 0 effort on our part, only ~1.5x slower than the completely redesigned Numpy version! I love Numba!
In conclusion, we can see that being Compiled is the main reason why Numpy is fast. Compilers nowadays are very advanced, and can implement smart execution, vectorization (SIMD), and a ton of other CPU optimizations without the programmer’s input!
References
https://numpy.org/doc/stable/reference/simd/index.html?highlight=cpu