Loop Replacement Strategies: Applications to Pandas Apply

7 min readDec 21, 2023

Image: cartoon of a speeding pandas bear, edited by author

In a previous article, “Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library”, I address why NumPy constructs, powered by Intel oneAPI, can achieve outsized performance and code readability and code maintainability advantages over a typical “roll your won” Python code segment.

In this article I want to apply the NumPy select tricks we learned in a previous article to accelerating a Pandas Apply statement hindered by conditional logic.

I will address the following:

Apply WHERE or SELECT in NumPy powered by Intel oneAPI to dramatically speed up certain common Pandas bottlenecks
Achieve good performance using NumPy. Select on a Pandas data frame
Achieve even better performance by converting the date frame to to NumPy arrays

It is important to know how to speed up Pandas natively via its dependence on NumPy. Pandas is powered by oneAPI via NumPy!

When the opportunity arises it often highly profitable to leverage the NumPy way of solving a Pandas apply() performance issue. Due to the nature of the size of many data frames, it is often better to uncover a way to apply NumPy instead.

While not yet a part of this course AI Tools from Intel (formerly referred to as the Intel® AI Analytics Toolkit) has a component called Modin* which is a drop in replacement for Pandas and this package can dramatically speed up Pandas operations. Modin can be used for problems larger than can fit in your laptops memory for example and can distribute computations across a cluster of nodes. Our aim is to include Modin as a component of training in the future.

There are a number of excellent references regarding speeding up NumPy or more specifically Pandas using NumPy and I encourage you to review these resources.

reference:

I will experimenting with a conjured example of a weird conditional logic applied to columns of a Pandas data frame with an expensive log function thrown in for good measure.

def my_function(x):
    return np.log(1+x)

def func(a,b,c,d,e):
    if e == 10:
        return c*d
    elif (e < 10) and (e>=7):
        return my_function(c+d)
    elif e < 7:
        return my_function(a+b+100)

Now we use our old friend Pandas APply to establish a timing baseline.

%%time
# naive loop method using pandas loc
import numpy as np

# each iteration of the loop requires an interpretation of the instructions being used and this decoding takes time
    
t1 = time.time()

df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)

t2 = time.time()
print("time : {:5.2f}".format(t2-t1))
df.head()
baseTime = t2-t1
timing['Pandas Apply'] =  t2 - t1
df.head()

output:
time :  8.37
CPU times: user 8.36 s, sys: 15.9 ms, total: 8.37 s
Wall time: 8.38 s

Hmmmm, I thought it would be faster — it vectorizes right?

Oh — conditional logic can hamper vectorization.

Can I do something about it?

Maybe you read one time about a trick called masking. We will do the conditional logic to see an index or mask for our data frame and use different masks for different conditions.

# masked approach
t1 = time.time()
df['new'] = df['c']  * df['d']  #default case e =10
mask = (df['e'] < 10) & (df['e'] >= 7)
df.loc[mask,'new'] = (df['c'] + df['d']).apply(lambda x : my_function(x))
mask = df['e'] < 7
df.loc[mask,'new'] = (df['a'] + df['b']).apply(lambda x : my_function(x + 100))
t2 = time.time()
print("time :", t2-t1)
fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['unrolled with masks on df'] = t2 - t1
df.head()

output:
time : 1.6978118419647217
Speed up:    5 X

WOW! Masking to the rescue!

Still — I wonder if I could do better?

I watched a cool video on Youtube by a guy introducing me to the Numpy “Select” clause. He had so many great tips — But off the top of my head I can remember the “Select” trick.

If you want to get serious about speeding up your python check these two references out!

Seriously — Look up Nathan Cheever talk 1000x faster data manipulation: vectorizing with Pandas and Numpy. His Advise is precient!

While you are at it — Look up Jake VanderPlas talk Losing your Loops Fast Numerical Computing with NumPy. I also recommend that you buy his book Python Data Science Handbook.

Let’s try Numpy “Select” clause

Notice that it cleans the code up alot!

You create a list of tuple containing your condition.
You create another list of tuples containg the opration you wish tou apply
You call np.select(condlist, choicelist, default=0)


# np.select(condlist, choicelist, default=0)
t1 = time.time()
################### add code here ###########
condition = [ (df['e'] < 10) & (df['e'] >= 7),
              ( df['e'] < 7)]
choice = [ (df['c'] + df['d']).apply(lambda x : my_function(x) ), 
           (df['a'] + df['b']).apply(lambda x : my_function(x + 100) ) ]
default = (df['c'] * df['d'])
np.select(condition, choice, default = default )
#############################################
np.select(condition, choice, default = default)
t2 = time.time()
print("time :", t2-t1)
timing['Numpy Select on Pandas df'] = t2 - t1
df.head()

output:
time : 1.684462308883667

Using NumPy.select with Pandas data frame operations yield a speedup of about 5X!

Not bad!

But, but I am using “numpy.select” and applying it to Pandas data frames.

Could we speed it up more if we drop the Pandas and go completely with NumPy? Maybe try masking, it is usally faster.


# Convert Pandas to numpy entirely
t1 = time.time()
npArr = df.to_numpy()  # convert to numpy
idx = {}  #intialize an indexing dictionary
for index, value in enumerate(df.columns):
    idx[value] = index
df.loc[:,'new'] = npArr[:,idx['c']] * npArr[:,idx['d']] #default case e =10
mask = (npArr[:,idx['e']] < 10) & (npArr[:,idx['e']] >= 7)
df.loc[mask,'new'] =  my_function(npArr[mask,idx['c']] + npArr[mask,idx['d']])
mask = (npArr[:,idx['e']] < 7)
df.loc[mask,'new'] = my_function(npArr[mask,idx['a']] + npArr[mask,idx['b']]  + 100)
t2 = time.time()
print("time :", t2-t1)
df.head()
fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['unrolled with Masks on dataframe'] = t2 - t1
df.head()

output:
time : 0.06982827186584473
Speed up:  120 X

WOW!! Now we are talking — something over 120X speedup!

Code looks a little messy though.

How about if we try the Numpy.Select trick again?

# np.select(condlist, choicelist, default=0)
# Convert Pandas to numpy entirely
t1 = time.time()
npArr = df.to_numpy()  # convert to numpy

condition = [ (npArr[:,idx['e']] < 10) & (npArr[:,idx['e']] >= 7),
              (npArr[:,idx['e']] < 7)]

choice = [(my_function(npArr[:,idx['c']] + npArr[:,idx['d']]      )), 
          (my_function(npArr[:,idx['a']] + npArr[:,idx['b']] + 100))]

tmp = np.select(condition, choice, default= (npArr[:,idx['c']] * npArr[:,idx['d']]) )

df.loc[:,'new'] = tmp
t2 = time.time()

print("time :", t2-t1)

fastest_time = t2-t1
Speedup = baseTime / fastest_time
print("Speed up: {:4.0f} X".format(Speedup))
timing['Numpy Select Pure'] = t2 - t1
df.head()

output:
time : 0.041852712631225586
Speed up:  200 X

200X speedup! Wow!

Plot the results

Chart: Speedups plotted using code referenced below on Xeon system referenced below

WOW!!! Hundreds time faster than Pandas Apply AND code is cleaner!

SHIP IT!

Play with these concepts on the Intel Developer Cloud:

Take the opportunity to play with replacing loop bound aggregations in your own code with NumPy aggregation functions instead.

For a sandbox to play in — register for a free account on the Intel Developer Cloud (cloud.intel.com), sign in and play by clicking on the icon in the lower left:

Image: Screenshot from cloud.intel.com

Then Launch JupyterLab on the shared access node in the icon on the right — see below:

Image: Screenshot from cloud.intel.com

Code

The code for this article and the rest of the series is located on github. For this article experiment with the file: 08_05_NumPy_Where_Select.ipynb

Article 1:

Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library. Explore the reasons why replacing inefficient Python loops with NumPy or PyTorch constructs is a great idea.

Article 2:

Python Loop Replacement: NumPy Optimizations Simple Stuff — ND array creation using NumPy, PyTorch, DPCTL. Explore simple ways of creating , converting and transforming Lists into NumPy NDarrays — a very basic getting started.

Article 3:

Introduction to NumPy* Universal functions (ufuncs). How I learned to stop worrying and let smart developers help me.

Article 4:

Replacing Python loops: Aggregations and Reductions. How to replace slow python loops by strategic function equivalents for aggregating data.

Article 5:

Replacing Python loops: Fancy Slicing and Broadcasting. Here I address fancy slicing and broadcasting to take advantage of key optimizations for loop replacement.

Article 6:

Python Loop Replacement: PyTorch & NumPy Optimizations. Not your SQL Select clause — Using Where and Select to vectorize conditional logic.

Article 7: (current article)

Loop Replacement Strategies: Applications to Pandas Apply. Examine how to accelerate Pandas Apply statement containing conditional logic.

Article 8:

NumPy Functions Composed. Compare Fast Inverse Square Root Method to NumPy ufuncs, Numba JIT, and Cython — Which One Wins?

Intel Developer Cloud System Configuration as tested:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0–223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000