High performance boolean indexing in Numpy and Pandas
In my hobby-ism with data science for the past few years, I’ve come to learn that there are many roads to the same destination. When you’re working with a small dataset, the road you follow doesn’t really matter, but when datasets go upwards in the gigabyte-terabyte range, speed becomes mission critical.
In Python, Numpy has made data manipulation really fast and easy using vectorization, and the drag caused by for loops have become a thing of the past. Boolean indexing (called Boolean Array Indexing in Numpy.org) allows us to create a mask of True/False values, and apply this mask directly to an array. Let’s look at a quick example
Creating the sample data
import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
print(x)''' Output
array(
[
[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
[1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 3],
[3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
[1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
[5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]
])
'''
The above code generates a 5 x 16 array of random integers between 1 (inclusive) and 10 (exclusive).
Boolean masking
A boolean mask allows us to check for the truthiness/falseness of values within the array, for example, the below code tells us that only the last item in the first row (index 0) is not greater than 1
x[0] > 1
''' Output
array([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False])
'''
We can also extend the indexing to row/column selection, so that if we want to check if each value in ALL (represented by :) rows in the column with index 5 is equal to 8, we write
x[:,5] == 8
''' Output
array([False, True, False, False, False])
'''
The above True/False array is called a BOOLEAN MASK
Furthermore, we can return all values where the boolean mask is True, by mapping the mask to the array.
For example, to return the row where the boolean mask (x[:,5] == 8) is True, we use
x[x[:,5]==8]
''' Output
array([[1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 3]])
'''
And to return the 15th-indexed column item using this mask, we use
x[x[:,5]==8, 15]
''' Output
array([3])
'''
Boolean Indexing Assignment
We can change the value of items of an array that match a specific boolean mask too. For example, to change the value of all items that match the boolean mask (x[:5] == 8) to 0, we simply apply the mask to the array like so
x[x[:,5]==8] = 0
print(x)
''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
[1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
[5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''
And to change the value in column index 15 using the same approach, we use (note that I had to ‘recreate the original x array before doing the below):
x[x[:,5]==8, 15] = 0
print(x)''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
[1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 0],
[3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
[1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
[5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''
So to perform a boolean assignment of this nature, we simply
- apply the mask
- filter the array
- perform the assignment
A slightly different approach
But then, what if we could do this same boolean indexing assignment using another approach, and I’ll show you in a moment…
To do the exact same thing we have done above, what if we reversed the order of operations by:
- filtering the array
- applying the mask
- performing the assignment
Can it work? Would it be any faster?
Filtering the array is quite simple, we can get the 15th indexed column from the array by
x[:,15]
And then we apply the mask to the column
x[:,15][x[:,5]==8]
And then we perform the assignment
x[:,15][x[:,5]==8] = 0
print(x)
''' Output
array([[3, 9, 3, 8, 5, 7, 4, 4, 9, 2, 7, 6, 7, 3, 3, 1],
[1, 9, 7, 3, 2, 8, 2, 1, 5, 1, 7, 7, 8, 2, 9, 0],
[3, 5, 6, 9, 7, 4, 6, 5, 2, 4, 2, 7, 2, 5, 4, 6],
[1, 9, 7, 7, 8, 7, 5, 5, 3, 7, 3, 2, 3, 7, 9, 8],
[5, 4, 7, 6, 8, 4, 1, 8, 5, 2, 7, 5, 3, 8, 1, 1]])
'''
Looks like it works as well!
So, which is faster? The first approach, or this latest approach?
Testing with timeit
There are more efficient ways to test execution speed, but let’s use timeit for simplicity.
The timeit module allows us to pass a complete codeblock as a string, and it computes by default, the time taken to run the block 1 million times
For our first approach
import timeit
s = '''
import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
x[x[:,5]==8, 15] = 0
'''timeit.timeit(s)
'''
74.233470999985
'''
And for the second
import timeit
s = '''
import numpy as np
np.random.seed(419)
x = np.random.randint(1,10, (5,16))
x[:,15][x[:,5]==8] = 0
'''timeit.timeit(s)
'''
67.96598569999333
'''
Looks like the second method is faster than the first
Conclusion
This is by no means a conclusive study of efficiency of data manipulation, so if you have any comments, additions, or even more efficient ways of item assignment in numpy, please leave a comment below, it is really appreciated!!!
Further reading (SQL)
How filtered indexes could be a more powerful feature (Aaron Bertrand): https://sqlperformance.com/2013/04/t-sql-queries/filtered-indexes
Partial Indexes (Data School): https://dataschool.com/sql-optimization/partial-indexes/