Python Loop Replacement: NumPy & PyTorch Optimizations and other tensor topics

Bob Chesebrough
7 min readDec 7, 2023

--

Simple Stuff — ND array creation using NumPy, PyTorch, Intel DPCTL

Using Python for loops for numerical work is like running in chest deep water — Image: by author: Caption used: “the persons head chest and arms should be visible but the legs and waist should be hidden in the green water” — subsequently used photoshop to push person behind the waves

In a previous article, Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library, I assumed that readers were familiar with moving into and out of NumPy arrays and were familiar with various ways of converting, creating, transforming NumPy arrays to and from other forms such as lists, pytorch tensors, dpctl tensors and so on.

In this article I want to address these topics head on and show how and why you might want to convert to and from NumPy nd arrays.

To start, N Dimensional arrays or ND arrays are a gateway into fast optimizations provided under the under the hood via oneMKL, through NumPy ufuncs, aggregations, broadcasting, fancy slicing, sorting and so on. If you are not using these methods with you NumPy arrays or equivalent PyTorch ones your progress is limited much like that of a person “running” in chest deep water. My goal is to show you how to get dry with NumPy arrays in preparation for running with NumPy ufuncs, aggregations and the like in the rest of these articles.

ND array from a python list

Below is a quick way to populate a ND array with a series of numbers from a pre-populated python list. It is likely that much of you legacy code is repeat with python lists. Here I assume you wish to preserve the random values generated by Python Random Uniform for testing purposes. Here is a way to convert them.

import random
N = 100_000_000 # Number of records to process
random.seed(42)
L = [random.uniform(0,100) for _ in range(N)]

a = np.array(L)


output: [63.942679845788376, 2.5010755222666936, 27.502931836911927,
22.321073814882276,…]

I urge you to strive for converting python lists early in the code and use the NumPy array through the rest of your code.

ND random array from scratch

Observe that the whole process of creating a list and then converting it can be a a bit slow. How can I create a random ND array of the right size immediately?

import numpy as np
np.random.seed(42)
a = np.random.uniform(low=0, high=100, size=(N,))

output: array([37.45401188, 95.07143064, 73.19939418, …, 37.52065796,
59.14032368, 35.31581811])

Notice that even though I specified the same seed value, 42, for random as well as np.random, I get different values assigned!

Create a zero filled ND array of a specified size

One of the key reasons NumPy is faster than numerical approaches in Python is that a data type is assigned at array creation time, and that data type can be specified: float, int32, or what have you.

np.empty([N, 2])
# or
np.ndarray(shape=(N,2), dtype=float)
# or
np.zeros([N, 2])

output: array([[0., 0.],
[0., 0.],
[0., 0.],
…,
[0., 0.],
[0., 0.],
[0., 0.]])

Create a filled ND array of a specified size

Similar to the above, you can create an array and fill it with ones or by extensions fill it easily with any constant. Also, you can create an “Identity matrix” very easily.

np.ones([N, 2])

output: array([[1., 1.],
[1., 1.],
[1., 1.],
…,
[1., 1.],
[1., 1.],
[1., 1.]])

np.ones([N,2])*np.pi

output: array([[3.14159265, 3.14159265],
[3.14159265, 3.14159265],
[3.14159265, 3.14159265],
…,
[3.14159265, 3.14159265],
[3.14159265, 3.14159265],
[3.14159265, 3.14159265]])

np.eye(100)

output: array([[1., 0., 0., …, 0., 0., 0.],
[0., 1., 0., …, 0., 0., 0.],
[0., 0., 1., …, 0., 0., 0.],
…,
[0., 0., 0., …, 1., 0., 0.],
[0., 0., 0., …, 0., 1., 0.],
[0., 0., 0., …, 0., 0., 1.]])

Fill an array with a sequence of values

Now we use the np.linspace to fill a vector with a sequence of values.

np.linspace(0, 10, num=10 + 1) # create a list os floats 0.0, 1.0, 2.0 …

output: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])

# prepare sine wave, one period, 13 points
N =12
np.sin(2* np.pi * np.linspace(0, N, num=N + 1)/N)

output: array([ 0.00000000e+00, 5.00000000e-01, 8.66025404e-01, 1.00000000e+00,
8.66025404e-01, 5.00000000e-01, 1.22464680e-16, -5.00000000e-01,
-8.66025404e-01, -1.00000000e+00, -8.66025404e-01, -5.00000000e-01,
-2.44929360e-16])

# prepare a vector of values sampled within a given function
def f(x):
return 2*x**2 -5*x
N = 100_000_000
f( np.linspace(0, N, num=2*N + 1))

Replace loops with NumPy equivalent functions

Beware — replace the functionality of the **loop** not the functionality of the functions within the loop.

Replace the following loop, creating a vector of sin values of a range of 10,000 elements.

# Simple minded brute force loop method
a = []
for i in range(10_000):
a.append(a, np.sin(i))

# SILLY — just replace the python append with NumPy append: BAD IDEA!
for i in range(10_000): # oops did NOT replace the loop!
a = np.append(a, np.sin(i)) #SLOWER than code above

# Best: replace the entire loop!
t = np.linspace(0, 10_000, num=10_000 + 1)
a= np.sin(t) # more readible, MANY times faster

Reshape and transpose arrays

Reshaping and transposing a matrix can be applied without using loops

L = [1,2,3,4,5,6,7,8] # original list
print(“Original List\n”,L)
a = np.array(L).reshape((4,2)) # transformed into 4x2 ND array
print(“reshape array tp have 4 rows, 2 columns\n”,a) # 4 rows x 2 columns
print(“Compute Transpose\n”,a.T) # 2 columns x 4 rows

Expanding dimensions and squeezing dimensions helps to make the arrays multiply or add together properly.

np.array(L).shape
print(“Dimensions of a\n”, np.array(L).shape) # dimensions of a
print(“Expand dimensions\n”, np.expand_dims(np.array(L), axis = 1).shape) #epand dimensions
print(“squeeze empty dimensions\n”, np.expand_dims(np.array(L), axis = 1).squeeze().shape) #squeeze dimensions

Convert to and from PyTorch tensor

a = [1, 2, 3]
b = torch.Tensor(a) # better for neural nets

output: tensor([1., 2., 3.])



a = [1, 2, 3]
b = torch.FloatTensor(a) # better for neural nets

output: tensor([1., 2., 3.])


a = [1, 2, 3]
b = torch.IntTensor(a)

output: tensor([1, 2, 3], dtype=torch.int32)


a = [1, 2, 3]
b = torch.Tensor(a).to(torch.int32)

output: tensor([1, 2, 3], dtype=torch.int32)

a = [1, 2, 3]
b = torch.Tensor(a).to(torch.half)

outputL tensor([1., 2., 3.], dtype=torch.float16)


a = [1, 2, 3]
b = torch.Tensor(a).to(torch.bfloat16) # Best used on HW that supports it natively

tensor([1., 2., 3.], dtype=torch.bfloat16)

Convert a PyTorch tensor on an XPU to a NumPy ND array on the CPU

x_np = x.detach().cpu() # detach gets rid of autograd, # cpu specifies destiantion

Convert to and from Intel dpctl tensor

The dpctl library is a library from Intel used to create a dpctl tensor that can be used with with Data Parallel Python*. Use this for heterogeneous programming using Data Parallel Extension for Numba* for AI and HPC and compute “follows data” paradigm with certain optimized algorithms in Intel(r) Extension for Scikit-learn accelerate these algorithms on Intel XPUs.

The dptctl conversion to and from NumPy (or from a python list) is easy.

 import dpctl
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
device = dpctl.select_default_device() # set device to XPU if one is available
x_device = dpctl.tensor.asarray(X, device = device) # data now on XPU in variable X_device
#calculation done on x_device are done on XPU

To convert dpctl tensor back to ND array is simple

#… calulations done with Intel Extensions for scikit-learn 
# assume function returns dpctl tensor called predictedGPU
dpctl.tensor.to_numpy(predictedGPU) # converts tensor to NumPy ND array

I spent a bit of time covering the basics of NumPy arrays, how to create them and manipulate them, convert them to and from PyTorch and dpctl tensors for use in Data Parallel Extension for Numba* and Intel Extensions for scikit-learn.

In upcoming articles, I will be exploring NumPy Universal Functions, Aggregations, Broadcasting and fancy slicing, NumPy Where/Select (these are NOT the same as sql Where and SELECT), applications of vectorizing Apply statements in Pandas, and more… including an article on comparing the Newton Raphson Fast reciprocal Square root (of Quake III) fame with NumPy and PyTorch built ins — you may be surprised at the results.

Code

The code for this article and the rest of the series is located on github

Related Articles:

Article 1:

Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library. Explore the reasons why replacing inefficient Python loops with NumPy or PyTorch constructs is a great idea.

Article 2: (Current article)

Python Loop Replacement: NumPy Optimizations Simple Stuff — ND array creation using NumPy, PyTorch, DPCTL. Explore simple ways of creating , converting and transforming Lists into NumPy NDarrays — a very basic getting started.

Article 3:

Introduction to NumPy* Universal functions (ufuncs). How I learned to stop worrying and let smart developers help me.

Article 4:

Replacing Python loops: Aggregations and Reductions. How to replace slow python loops by strategic function equivalents for aggregating data.

Article 5:

Replacing Python loops: Fancy Slicing and Broadcasting. Here I address fancy slicing and broadcasting to take advantage of key optimizations for loop replacement.

Article 6:

Python Loop Replacement: PyTorch & NumPy Optimizations. Not your SQL Select clause — Using Where and Select to vectorize conditional logic.

Article 7:

Loop Replacement Strategies: Applications to Pandas Apply. Examine how to accelerate Pandas Apply s

Article 8:

NumPy Functions Composed. Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Intel Developer Cloud System Configuration as tested:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0–223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000

--

--

Bob Chesebrough

Robert Chesebrough is currently a Solution Architect in the Intel Developer Academy where he teaches others to apply optimizations to data science algorithms.