High performance boolean indexing in Numpy and Pandas

Published in

The Startup

5 min readMay 6, 2020

In my hobby-ism with data science for the past few years, I’ve come to learn that there are many roads to the same destination. When you’re working with a small dataset, the road you follow doesn’t really matter, but when datasets go upwards in the gigabyte-terabyte range, speed becomes mission critical.

In Python, Numpy has made data manipulation really fast and easy using vectorization, and the drag caused by for loops have become a thing of the past. Boolean indexing (called Boolean Array Indexing in Numpy.org) allows us to create a mask of True/False values, and apply this mask directly to an array. Let’s look at a quick example

Creating the sample data

import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
print(x)''' Output
array(
[
    [3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
    [1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 3],
    [3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
    [1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
    [5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]
])
'''

The above code generates a 5 x 16 array of random integers between 1 (inclusive) and 10 (exclusive).

Boolean masking

A boolean mask allows us to check for the truthiness/falseness of values within the array, for example, the below code tells us that only the last item in the first row (index 0) is not greater than 1

x[0] > 1
''' Output
array([ True,  True,  True,  True,  True,  True,  True,  True,  True, True,  True,  True,  True,  True,  True, False])
'''

We can also extend the indexing to row/column selection, so that if we want to check if each value in ALL (represented by :) rows in the column with index 5 is equal to 8, we write

x[:,5] == 8
''' Output
array([False,  True, False, False, False])
'''

only the second item in the 6th row (index = 5) equals 8

The above True/False array is called a BOOLEAN MASK

Furthermore, we can return all values where the boolean mask is True, by mapping the mask to the array.

For example, to return the row where the boolean mask (x[:,5] == 8) is True, we use

x[x[:,5]==8]
''' Output
array([[1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 3]])
'''

And to return the 15th-indexed column item using this mask, we use

x[x[:,5]==8, 15]
''' Output
array([3])
'''

returning item in column index 15 where boolean mask is true

Boolean Indexing Assignment

We can change the value of items of an array that match a specific boolean mask too. For example, to change the value of all items that match the boolean mask (x[:5] == 8) to 0, we simply apply the mask to the array like so

x[x[:,5]==8] = 0
print(x)
''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
       [1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
       [5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''

And to change the value in column index 15 using the same approach, we use (note that I had to ‘recreate the original x array before doing the below):

x[x[:,5]==8, 15] = 0
print(x)''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
       [1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 0],
       [3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
       [1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
       [5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''

So to perform a boolean assignment of this nature, we simply

apply the mask
filter the array
perform the assignment

A slightly different approach

But then, what if we could do this same boolean indexing assignment using another approach, and I’ll show you in a moment…

To do the exact same thing we have done above, what if we reversed the order of operations by:

filtering the array
applying the mask
performing the assignment

Can it work? Would it be any faster?

Filtering the array is quite simple, we can get the 15th indexed column from the array by

x[:,15]

And then we apply the mask to the column

x[:,15][x[:,5]==8]

And then we perform the assignment

x[:,15][x[:,5]==8] = 0
print(x)
''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
       [1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 0],
       [3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
       [1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
       [5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''

Looks like it works as well!

So, which is faster? The first approach, or this latest approach?

Testing with timeit

There are more efficient ways to test execution speed, but let’s use timeit for simplicity.

The timeit module allows us to pass a complete codeblock as a string, and it computes by default, the time taken to run the block 1 million times

For our first approach

import timeit
s = '''
import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
x[x[:,5]==8, 15] = 0
'''timeit.timeit(s)
'''
74.233470999985
'''

And for the second

import timeit
s = '''
import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
x[:,15][x[:,5]==8] = 0
'''timeit.timeit(s)
'''
67.96598569999333
'''

Looks like the second method is faster than the first

Conclusion

This is by no means a conclusive study of efficiency of data manipulation, so if you have any comments, additions, or even more efficient ways of item assignment in numpy, please leave a comment below, it is really appreciated!!!