Replacing Python loops: Aggregations and Reductions

How to replace slow python loops by strategic function equivalents for aggregating data

5 min readDec 20, 2023

Image by Dall-E: cartoon boiling flacks connected to glassware “coiled condenser” doing distillation

In a previous article, “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library”, I address why NumPy constructs, powered by Intel oneAPI, can achieve outsized performance and code readability and code maintainability advantages over a typical “roll your won” Python code segment.

In this article I want to address common concepts for aggregating or reducing data dimensions using functions similar to sums and means and other aggregation functions implemented as Universal Functions in NumPy.

Aggregations/ Reductions:

Aggregations or reductions are means of compressing or shrinking the dimensionality of data into a summarized state using mathematical functions.

Some examples would be computing a sum, an average, the median, the standard deviation, or the mode of a column or row of data, as seen in the table below:

Table 1: Examples of cpmmon aggregations

But there are more ways than these to aggregate data that are used commonly in data science and statistics.

Other common aggregations are min, max, argmin, argmax.

It is worthwhile exploring the wide set of methods provided by NumPy ufuncs which contain many of the aggregations.

Below is a table of aggregations commonly used in machine learning:

Table 2: UFuncs commonly used in machine learning.

Benefits of replacing “roll your own” Python loops with NumPy Aggregations:

As I explained in “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library”, NumPy, powered by Intel oneAPI, can achieve higher levels of readability, maintainability, future proofing your code for new hardware and software library optimizations, and performance than a straight forward “roll your own” loop.

I’ll provide an example below of my “roll your own” Python code for computing the mean and standard deviation, since this si done frequently in machine learning, and compare the readbility and performance of my code versus a straightforward use of NumPy to acheive the same computation.

rng = np.random.default_rng(2021)
# random.default_range is the recommended method for generated random's
# see blog "Stop using numpy.random.seed()" for reasoning
# https://towardsdatascience.com/stop-using-numpy-random-seed-581a9972805f

a = rng.random((10_000_000,))
t1 = time.time()
timing = {} # collect timing information
S = 0
for i in range (len(a)): # compute the sum
    S += a[i]
mean = S/len(a) # compute the mean
std = 0
for i in range (len(a)):  # compute the sum of the differences from the mean
    d = a[i] - mean
    std += d*d
std = np.sqrt(std/len(a)) #compute the standard deviation
timing['loop'] = time.time() - t1
print("mean", mean)
print("std", std)

print(timing)

Compare the code above to the much more readable and maintainable code below:

t1 = time.time()
print(a.mean())
print(a.std())

timing['numpy'] = time.time() - t1
print(timing)
print(f"Acceleration {timing['loop']/timing['numpy']:4.1f} X")

As you can see, the NumPy code is much more readable ( 3lines of code versus 16 lines), more maintainable, and it is faster too. The timing comparison is as follows ‘naive loop’: 2.8 seconds, NumPy mean/std: .04 seconds. This is a speedup over the naive code by a factor of over 60X, see my article, “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library”, for the underlying reasons.

Chart 1: speedup of NumPy mean/std code over my naive Python implementation

Flexibility of NumPy Aggregations

Another aspect of many NumPy aggregations/ reductions is the ability to specify an axis over which to apply the reducing function. As an example, consider the numPy sum() function. Without an axis specifier it adds ALL the values in an N dimensions array to arrive at a scalr value. But, if you specify and axis, in the example below I specify axis=0, it returns a row vector that sums each column of the matrix.

Alternatively, if I specify axis=1, I get a column vector whose entries are the sums of each row.

Flexibility is provided by specifying parameters for each aggregation function to accommodate variations in how the aggregation should be done as we see in the sum example above.

Take the opportunity to play with replacing loop bound aggregations in your own code with NumPy aggregation functions instead.

For a sandbox to play in — register for a free account on the Intel Developer Cloud (cloud.intel.com), sign in and play by clicking on the icon in the lower left:

Then Launch JpyterLab on the shared access node in the icon on the right — see below:

Code

The code for this article and the rest of the series is located on github. For this article experiment with the file: 08_03_NumPy_Aggregations.ipynb

Article 1:

Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library. Explore the reasons why replacing inefficient Python loops with NumPy or PyTorch constructs is a great idea.

Article 2:

Python Loop Replacement: NumPy Optimizations Simple Stuff — ND array creation using NumPy, PyTorch, DPCTL. Explore simple ways of creating , converting and transforming Lists into NumPy NDarrays — a very basic getting started.

Article 3:

Introduction to NumPy* Universal functions (ufuncs). How I learned to stop worrying and let smart developers help me.

Article 4: (Current Article)

Replacing Python loops: Aggregations and Reductions. How to replace slow python loops by strategic function equivalents for aggregating data.

Article 5:

Replacing Python loops: Fancy Slicing and Broadcasting. Here I address fancy slicing and broadcasting to take advantage of key optimizations for loop replacement.

Article 6:

Python Loop Replacement: PyTorch & NumPy Optimizations. Not your SQL Select clause — Using Where and Select to vectorize conditional logic.

Article 7:

Loop Replacement Strategies: Applications to Pandas Apply. Examine how to accelerate Pandas Apply s

Article 8:

NumPy Functions Composed. Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Intel Developer Cloud System Configuration as tested:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0–223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000