Introduction to NumPy* Universal functions (ufuncs)

How I learned to stop worrying and let smart developers help me.

Bob Chesebrough
6 min readDec 8, 2023
Image source: Production line icons (macroventor via freepik)

What are Universal Functions?

In NumPy, Universal Functions, ufuncs, are simply functions that operate on NumPy ND arrays in an conceptually element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. In reality, many ufuncs are accelerated via a mechanism called vectorization — or the application of single instruction multiple data (SIMD) registers and HW level instructions, perhaps in combination with high level memory optimizations provided by a library or vectorizing “C” compiler, all baked into a pre-built stock version of NumPy*. See my previous article “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library” to get more detail on this.

Let’s Explore Some Examples of “ufuncs”

Universal Functions span a wide variety, likely thousands, of commonly used functions. From math “transcendental” functions, such as sines, cosines and tangents, to hyperbolic sines, cosines, and the like to inverse trig and inverse hyperbolic functions logarithm related and exponential related algorithms just to get started.

Math operations are included too, such as reciprocal, reciprocal, matmul and many, many more. Some of these functions can be invoked as “operators”, such as “+” — invoking addition, or “-“ invoking subtraction, and so on. Functions can be daisy chained too, such as:

np.reciprocal(np.sqrt(a))

Many bit twiddling functions are included to make it fast and efficient to compare or perform bit wise operations on one or two vectors. For example invoking bitwise AND or bitwise OR, bitwise shifts of numbers, exclusive or operations etc.

Similarly, logical functions comparing each position of two vectors to identify which positions vector A is greater then or less than vector B, as well as similar logical functions such as finding the locations where two vectors are equal, or performing a logical not on a single vector.

Then, floating point operations and test are included as well. Such as checking which locations in a vector is finite, where the “Not a Number” (nans) are located, finding where the elements of a vector are infinite (inf) etc.

Table 1 shows a partial summary of functions included as ufuncs

Table 1. Partial summary of Universal Functions

In the code below, we see some simple minded code to round each element of a python list of floats:

import random
random.seed(42)
num = 10_000_000
a = []
for i in range(num): # assume we are given a random list, so don't time this
a.append(random.random()*100) # don't time this

b = []
for i in range(num):
b.append(round(a[i],5))

Compare this above code to the faster and more succinct list comprehension in below:

b = [round(x,5) for x in a] # assume we are given a random list

Note that in all these following code examples, I convert the python list to a NumPy array — in actual practice, this conversion happens one time near the top of the code, and then we freely reuse the NumPy arrays for all manipulations after that. The speedups attained are more impressive when we take this into consideration. Then compare to the NumPy equivalent below:

b = np.round(np.array(a), 5) #assume we strat with  python list

The acceleration gained from applying NumPy to replace the loop is remarkable and the readability of the NumPy code is on the same order as the readability of the list comprehension code. See Figure 1.

Compare wall clock time of different methods to Round a list of values. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel(R) Xeon(R) Platinum 8480L, 224 CPU, 503GB RAM

Next I’ll examine composition of functions or nesting or daisy chaining functions. Consider computing the reciprocal of the square root of a value, which is common in physics and engineering. Compare the following three loops for readability and speed:

Naive Python Loop:

for i in range(len(a)):
b.append(1.0/math.sqrt(a[i]))

Python List Comprehension:

b = [1.0/math.sqrt(x) for x in a]

NumPy composed functions of reciprocal and sqrt:

b = np.reciprocal(np.sqrt(a))

In Figure 2. we se the acceleration provided by combining NumPy ufuncs over the Naive Loop and list comprehension methods

Figure 2. NumPy Nested ufuncs is faster than Naive loop and list comprehensions methods. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel(R) Xeon(R) Platinum 8480L, 224 CPU, 503GB RAM

Computing AI focused operations such as mean and std

Below is code to compute the mean and std of an array, first by rolling my own python loops. Below is the Naive loop:

S = 0
for i in range (len(a)): # compute the Sum
S += a[i]
mean = S/len(a) # compute the mean
std = 0
for i in range (len(a)): # zero the data around the mean
d = a[i] - mean
std += d*d
std = np.sqrt(std/len(a)) # compute the std

Below is the NumPy approach. Compare how much more readable, maintainable the NumPy code is. It is also faster on existing hardware and will likely be faster on future hardware or software library updates.

print(a.mean())
print(a.std())
Figure 3. Compare computing mean and std using a naive loop versus using NumPy. Measured on shared node on the new Intel Developer Cloud : Ubuntu 22, I see the following (Intel(R) Xeon(R) Platinum 8480L, 224 CPU, 503GB RAM

How can you determine if the Intel oneMKL library is accelerating your NumPy?

First of all, the Stock version of NumPy has the Intel(R) oneMKL optimizations built in already. But if you wish to double check, run this python command:

np.show_config()

You should see output such as this — revealing the Intel oneAPI libraries referenced by NumPy:

lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/glob/development-tools/versions/oneapi/2023.1.2/oneapi/pytorch/1.13.10.2/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/glob/development-tools/versions/oneapi/2023.1.2/oneapi/pytorch/1.13.10.2/include']

Thousands of such examples could be provided but my hope is that I have motivated the use of ufuncs sufficiently for you to try them.

Code

The code for this article and the rest of the series is located on github primarily notebook: 08_02_NumPy_UFUNCS.ipynb

Related Articles:

Article 1:

Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library. Explore the reasons why replacing inefficient Python loops with NumPy or PyTorch constructs is a great idea.

Article 2:

Python Loop Replacement: NumPy Optimizations Simple Stuff — ND array creation using NumPy, PyTorch, DPCTL. Explore simple ways of creating , converting and transforming Lists into NumPy NDarrays — a very basic getting started.

Article 3: (Current article)

Introduction to NumPy* Universal functions (ufuncs). How I learned to stop worrying and let smart developers help me.

Article 4:

Replacing Python loops: Aggregations and Reductions. How to replace slow python loops by strategic function equivalents for aggregating data.

Article 5:

Replacing Python loops: Fancy Slicing and Broadcasting. Here I address fancy slicing and broadcasting to take advantage of key optimizations for loop replacement.

Article 6:

Python Loop Replacement: PyTorch & NumPy Optimizations. Not your SQL Select clause — Using Where and Select to vectorize conditional logic.

Article 7:

Loop Replacement Strategies: Applications to Pandas Apply. Examine how to accelerate Pandas Apply s

Article 8:

NumPy Functions Composed. Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Intel Developer Cloud System Configuration as tested:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0–223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000

--

--

Bob Chesebrough

Robert Chesebrough is currently a Solution Architect in the Intel Developer Academy where he teaches others to apply optimizations to data science algorithms.