Random Sampling using SciPy and NumPy: Part II
Fancy algorithms, source code walkthrough and potential improvements
In Part I we went through the basics of Inverse Transform Sampling (ITS) and created our own ITS pure python implementation to sample numbers from a standard normal distribution. We then compared the speed of our somewhat optimised function to that of the built in SciPy function and found ourselves somewhat lacking — to the tune of being 40x
slower.
In this part the aim is to explain why that is the case by digging through the relevant bits of the SciPy and NumPy code base to see where those speed improvements manifest themselves. In general we will find that it’s made up of a combination of:
- faster functions either due to being written in Cython or straight C
- faster newer sampling algorithms compared to our tried and tested Inverse Transform Sampling
How do we generate normally distributed random samples in SciPy?
The following is the code to generate 1,000,000
random numbers from a standard normal distribution.
43.5 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So the function rvs
generates 1,000,000
samples in just over 40ms
. For comparison we were able to achieve this on average in 2.3s
using our algorithm which was based on the principle of inverse transform sampling. To understand the speed differences we're going to have to dive into that rvs
method.
It’s worth noting that (in general) with SciPy the core of the logic is contained in underscore methods — so when we want to have a look into rvs
really we want to see the code for _rvs
. The non-underscore methods generally implement some argument type checking or defaulting before handing over to the underscore methods.
Before working our way through let’s just do a brief overview of the way SciPy organises distribution functionality in the library.
rv_generic and rv_continuous
SciPy distributions are created from a neat inheritance structure with:
rv_generic
as the top level class providing methods likeget_support
andmean
rv_continuous
andrv_discrete
inheriting from it with more specific methods
So in the above case where we initiated our normal distribution class snorm
as stats.norm()
what that is really doing is creating an instance of rv_continuous
which inherits a lot of functionality from rv_generic
. To be even more specific, we actually create an rv_frozen
instance which is a version of rv_continuous
but with the params of the distribution fixed (e.g. the mean and variance). With that in mind, let's now peer inside the rvs
method.
rvs
When we run the ??
magic on snorm.dist._rvs
we see the following code snippet:
# ?? snorm.dist._rvs
def _rvs(self, size=None, random_state=None):
return random_state.standard_normal(size)
So it seems like somewhere in the distribution class we created we have assigned a random_state
object somewhere and that random_state
object contains a method that can return numbers distributed according to a standard normal distribution.
It turns out that the random_state
object that spits out these random numbers is actually from NumPy. We see this by looking at the source code for rv_generic which contains in its __init__
method a call to a SciPy util method called check_random_state which, if no seed is passed already, will set the random_state
as an instance of np.random.RandomState
. Below is this code snippet:
Over to NumPy
So it seems like the ‘magic’ that delivers such blazing fast sampling actually sits in NumPy, not SciPy. This shouldn’t be all that shocking as SciPy is deliberately built on top of NumPy to prevent duplication and inconsistencies where the two libraries may provide identical features. This is explicitly stated in the first line of the SciPy Intro documentation here:
“SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python.”
To see what is going on we can have a look at the np.random.RandomState
class here. We can see from the use of:
cdef
instead ofdef
for function declaration- a
.pyx
file extension instead of .py
which both indicate that the function is written using Cython — a language very similar to Python that allows functions to be written in almost python syntax, but then compiled into optimised C/C++ code for efficiency. As they put it themselves in the documentation:
“The source code gets translated into optimised C/C++ code and compiled as Python extension modules. This allows for both very fast program execution and tight integration with external C libraries, while keeping up the high programmer productivity for which the Python language is well known.”
Within this class there are two things we need to look at to understand the sampling process:
- what it is doing to generate the uniformly distributed random numbers (the PRNG)
- what algorithm it is using to convert these uniformly distributed numbers into normally distributed numbers
The PRNG
As mentioned in Part I, generating a random sample requires some form of randomness. Almost always this isn’t true randomness, but a series of numbers generated by a ‘pseudo-random number generator’ (PRNG). Just as with sampling algorithms, there are a variety of PRNGs available and the specific implementation used here is detailed in the __init__
method of np.random.RandomState
:
As the above shows, when the class is initiated, the default PRNG is set to be an implementation of the Mersenne Twister algorithm — named as such as it has a period length of a Mersenne prime (the number of random numbers it can generate before it starts to repeat itself).
The Sampling Process
Some way down the code for the class np.random.RandomState
we see the definition of standard_normal
making a call to something called legacy_gauss
. The C code for the legacy_gauss
function is here and for ease of viewing we'll show it here:
As can be seen on Wiki in the implementation section, this is none other than a C implementation of the Marsaglia Polar Method for generating random samples from a normal distribution given a stream of uniformly distributed input numbers.
Recap
We’ve gone through a lot there so it’s worth stepping back through and making sure everything is crystal clear. We’ve gone from:
- a SciPy function called
_rvs
, written in python, initiates - a NumPy class
np.random.RandomState
, written in Cython, which - generates uniformly distributed numbers using the Mersenne Twister algorithm and then
- feeds these numbers into a function
legacy_gauss
, written in C, which churns out normally distributed samples using the Marsaglia Polar method
The above highlights the lengths that the clever people building SciPy and NumPy have gone to to generate efficient code. We have a top layer callable by users (like you and me) that is written in python (for the ‘programmer productivity’ of python) before deeper layers of the infrastructure are increasingly written as close to C as possible (for speed).
Why is SciPy calling a NumPy function deemed ‘legacy’?
Because sampling is a branch of maths / computer science that is still moving forward. Unlike other areas where certain principles were agreed upon centuries ago and haven’t seen change since, efficiently sampling various distributions is still seeing fresh developments. As new developments get tested, we would like to update our default processes to incorporate these advancements.
This is exactly what happened in July 2019 with NumPy 1.17.0 when they introduced 2 new features that impact sampling:
- the implementation of a new default pseudo-random number generator (PRNG): Melissa O’Neil’s PCG family of algorithms
- the implementation of a new sampling process: the Ziggurat algorithm
Due to the desire for backward compatibility of PRNGs however, instead of creating a breaking change they introduced a new way to initiate PRNGs and switched the old way over to reference the ‘legacy’ code.
The backward compatibility referenced here is the desire for a PRNG function to generate the same string of random numbers given the same seed. Two different algorithms will not produce the same random numbers even if they are given the same seed. This reproducibility is important especially for testing.
It appears SciPy hasn’t been upgraded yet to make use of these new developments.
Can we beat SciPy?
Given we know what we know now about how normal distribution sampling is implemented in SciPy, can we beat it?
The answer is yes — by making use of the latest developments in sampling implemented for us in NumPy. Below is an implementation of sampling where we:
- use the latest PRNG
- use the new ziggurat algorithm for converting these numbers into a normally distributed sample
# test scipy speed
%timeit snorm.rvs(size=n)51 ms ± 5.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)# test numpy speed
%timeit nnorm.normal(size=n)24.3 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it seems like we’re around 2x
as fast as SciPy now - something that is in the expected 2-10x bracket as NumPy highlights in their release here.
Conclusion: how useful is this?
When it comes to implementing custom distribution sampling: very useful. We now fully understand the decision to pursue SciPy-esque sampling speed and can implement custom distribution sampling appropriately. We can either:
- stick with the pure python inverse sampling transform implementation in Part I (after all,
2s
isn't bad for a sample of1,000,000
in most contexts) - write our own sampling procedure — and preferably write this sampling procedure in C or Cython — which is no small ask
In the next part we’ll look at doing just that — implement an efficient custom distribution sampling function within the SciPy infrastructure. This gives us the best of both worlds — the flexibility to implement the exact distribution of our choice along with making use of the efficient and well written methods that we inherit from the rv_generic
and rv_continuous
SciPy classes.