NumPy Functions Composed

Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Bob Chesebrough
5 min readDec 7, 2023
Motion blur Einstein: Image by author using Dall-E and Photoshop: “cartoon einstein running fast with motion lots of blur”

Efficient variations in the way we approach computing Reciprocal Sqrt can perform as much as 3 to 4 orders of magnitude faster than older methods.

Which one you choose depends on your need for accuracy, speed, reliability, maintainability.

See which one scores highest in your needs assessment!

What is reciprocal sqrt?

It is a function composed of two operations:

  1. Reciprocal (Multiplicative Inverse)
  2. Square root of a value or vector

Where is it used?

Physics/ Engineering

  • Partial derivative of distance formula with respect to one dimensions, as follows:

Special Relativity Lorentz transformation

  • The gamma coefficient is used to compute time dilation or length contraction among other calculations in special relativity

3D Graphics or ML applications needing normalization in space

  • Vector normalization in video games and is mostly used in calculations involved in 3D programming — This is why the Fast Reciprocal Sqrt was invented for Quake III

This is an interesting topic, as it is intersects recent history of algorithms and hardware and software library implementations. Methods for computing this quickly even as an approximation at first outstripped the instructions sets on modern x86 architecture. Two Berkeley guys, William Kahan and K.C. Ng wrote an unpublished paper in May 1986 describing how to calculate the square root using bit-fiddling techniques followed by Newton iterations. This was picked up by Cleve Moller and company of MatLab* fame, Cleve’s associate Greg Walsh devised the now-famous constant and fast inverse square root algorithm. Read the fascinating history on wikipedia.

It is still common when searching on the web and you will find many articles and advise on using this algorithm today. This algorithm is an ingenious application of Newton-Rapson method to get fast approximations of this 1/sqrt()

Then the introduction by Intel of the Pentium III in 1999 saw a new SSE instruction which would compute reciprocal sqrt as part of SSE instruction set — a vectorized instruction!

What is the best algorithm now?

That all depends on perspective. The Fast Reciprocal Sqrt is a clever trick, and from what I see its accuracy is with about 1% of the actual value. SSE and AVX are vectorized instructions that are based on IEEE floating point standards, but this is also an approximation. Its accuracy depends on the nature of the data type you choose, but is typically far better than 1%. When dealing floating point calculations such as these — the order of operations and how you accumulate partial sums etc matter. So let me provide you the code and a little discussion of what I found and you be the judge!

Take a look at this article I found which gives the edge to NumPy — See article Is Fast Inverse Square Root still Fast?

My results are a lot more dramatic — but I leave open the possibility that I have made a coding mistake.

Several implementation in Python can be found on the web.

See Article by acjr.net:”The_Fast_Inverse_Square_Root_method_in_Python

or this article by Doug Foo, “ Death of Quake 3’s Inverse Square Root Hack”

What I test in this notebook? (see github link at the end)

In this workbook we will test the following approaches:

PyTorch rsqrt:

  • Use torch built in function rsqrt()

NumPy_Compose RecipSqrt

  • Use NumPy np.reicprocal(np.sqrt())

NumPy_Simple

  • Use Numpy implicit vectors: b = a**-.5

Cython_Simple

  • Use Cython variant of Simple a**-.5

Numba_Simple

  • Use Numba njit variant of Simple a**-.5

BruteForceLoopExact

  • Brute force loop approach no vectorization at all

Fast Reciprocal Sqrt Newton-Raphson simple Loop

  • Fast Reciprocal Sqrt using Newton Raphson and Quake III approach

Fast Reciprocal Sqrt Newton-Raphson Vectorized

  • Fast Reciprocal Sqrt using Newton Raphson and Quake III approach vectorized with np.apply

Fast Reciprocal Sqrt Newton-Raphson Cython

  • Fast Reciprocal Sqrt using Newton Raphson and Quake III approach in Cython.

Where did the results finish?

It depends on the platform

On the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel(R) Xeon(R) Platinum 8480L, 224 core, 503GB RAM

Testing Various algorithms and Optimizations for Inverse Square Root

My results tend to align with the observation of Doug Woo’s article and I see orders of magnitude speedup using built in functions in NumPy and PyTorch. But let’s still celebrate the ingenuity of the creators of the Fast Reciprocal Sqrt — guys — here’s a toast!

Code

The code for this article and the rest of the series is located on github. For this article experiment with the file: Inverse_Sqrt.ipynb

Related Articles:

Article 1:

Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library. Explore the reasons why replacing inefficient Python loops with NumPy or PyTorch constructs is a great idea.

Article 2:

Python Loop Replacement: NumPy Optimizations Simple Stuff — ND array creation using NumPy, PyTorch, DPCTL. Explore simple ways of creating , converting and transforming Lists into NumPy NDarrays — a very basic getting started.

Article 3:

Introduction to NumPy* Universal functions (ufuncs). How I learned to stop worrying and let smart developers help me.

Article 4:

Replacing Python loops: Aggregations and Reductions. How to replace slow python loops by strategic function equivalents for aggregating data.

Article 5:

Replacing Python loops: Fancy Slicing and Broadcasting. Here I address fancy slicing and broadcasting to take advantage of key optimizations for loop replacement.

Article 6:

Python Loop Replacement: PyTorch & NumPy Optimizations. Not your SQL Select clause — Using Where and Select to vectorize conditional logic.

Article 7:

Loop Replacement Strategies: Applications to Pandas Apply. Examine how to accelerate Pandas Apply s

Article 8: (Current article)

NumPy Functions Composed. Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Intel Developer Cloud System Configuration as tested:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0–223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000

--

--

Bob Chesebrough

Robert Chesebrough is currently a Solution Architect in the Intel Developer Academy where he teaches others to apply optimizations to data science algorithms.