Accelerate Numerical Calculations in NumPy With Intel oneAPI Math Kernel Library

Bob Chesebrough
7 min readJul 28, 2023

--

<a href=”https://www.freepik.com/free-vector/speedometer-background-design_1075143.htm#query=speed%20dial&position=6&from_view=keyword&track=ais">Image by rizz13</a> on Freepik

In this article, I will describe why some kinds of numerically intensive tasks in Python run very slowly. I also explore steps to dramatically accelerate numerical calculations using the NumPy* package. But, the magic is not simply using NumPy ndarray — that’s not really where the dramatic acceleration comes from — but rather it comes by replacing Python loops with a single NumPy function calls built on top of Intel® oneAPI Math Kernel Library (oneMKL). These changes streamline the code and can make it:

1) Simpler,
2) More readable and maintainable,
3) Run faster on existing hardware, and
4) Ready for future hardware advances.

Why I am a big fan of Python

Over the years I have programmed professionally using Fortran, C/C++, Java, and Python. But for the last several years, Python has become my home base, my go-to language for crunching numbers, manipulating GPS data for dinosaur bone hunting, handling text based chores such as finding commonalities in research papers and more. The reasons I love Python are:

1) It is great for rapidly proto typing ideas
2) You can tackle about every imaginable coding task with a few lines of code
3) Examples of code snippets are everywhere
4) It is easy and fast to code in: Dynamically typed, thus making programming easy.
5) It is an interpretive language so no need to change gears to compile, link then execute.
6) Pythonic means it is very intuitive, even when I do not exactly know the syntax. I can guess at a parameter which often just works.
7) For AI, it is fast to implement with little to no porting across x86 — Intel or AMS or even IBM* Power system.

When and why is Python slow at massively repeated low level tasks?

Python is slow for some things:
• Repeated low level tasks
• Large loops
• Nested Loops
• List comprehensions (if large)

Since Python is dynamically typed, every time we touch a new element of a list, for example, Python must determine the type of object is in that position in the list, what operations are valid for that object, it has to wrangle with reference counters to that object, and there is no guarantee that what we think should be consecutive objects in the list are actually consecutive in memory layout. In Figure 1 below, I am adding two lists, element by element, loop iteration followed by loop iteration. Every touch to read or write takes this overhead.

Figure 1. Adding two lists in Python

Compare the efficiency of the above approach to that illustrated in Figure 2 below, by SIMD (utilizing Intel® oneMKL) operations accessible via NumPy functions or operators — as illustrated the work is consumed 8 times faster.

Figure 2. Adding two NumPy arrays with built in NumPy operators, using oneMKL SIMD vectorization.

Since the data type is specified for the entire NumPy ndarray it naturally lays out in a “C” like memory pattern. This allows fast referencing of the data and efficient use of underlying cache lines and SIMD registers, which on modern x86 architectures. Specifically, the Intel® Xeon Scalable processors are 512 bits wide. If the data elements above are F32, this means that 16 concurrent (8 are shown above due to spatial limitations on a written page) floating point operations can be carried out per loop iteration (512 bits divided by 32 bits = 16 units). This reduces the number of loops I need to consume by a factor of 16 as well. And this is NOT threading! This benefit occurs per core.

Figure 3. shows the random memory layout of 8 integers stored in a Python list. The memory locations are likely scattered across the memory in random fashion.

Figure 3. Fictionalized memory layout of a list of 8 integers in a Python list

In Figure 4 below, we contrast the random memory from Figure 3 to a ndarray laid out with N contiguous memory locations.

Figure 4. showing a ndarray with contiguous memory layout

Because memory is now contiguous, we can efficiently access these items with a single fetch from memory of a quantity of 8 or 16, or however many items fit in a 512 bit register. This is possible for each thread running on a CPU core!

Now consider that happens to cache utilization, contrasted between the two cases. The cache story is analogous to the following:

Consider you have several guests stay with you overnight. Each guest in turn, comes out for breakfast and requests a single egg for breakfast. You go to the refrigerator, grab a carton of eggs (representing what is known as a single cache line containing say 16 elements) and fry a random egg in the carton.

The next guest drops in for breakfast and requests a single egg. But in your harried state of mind with having so many guests, you forget you already have a mostly unused carton of eggs on the countertop, and dutifully head back to the refrigerator to grab another carton, only to fry a single random egg from it! See Figure 5.

Figure 5. Showing one single egg being retrieved from each carton

You repeat this procedure for each guest until the countertop is full of mostly unused egg cartons and they crash to the floor resulting in broken eggs. This is similar to a cache line eviction! See figure 6 below.

Figure 6. Showing the cartons knocked of the countertop, analogous to a cache line eviction

Now consider how a more efficient chef might prepare the eggs. As each guest arrives to request a breakfast, each one asked for an egg, so you wisely consume all the eggs in one carton before going back to the refrigerator to grab another carton.

Figure 7. Showing a wise use of egg cartons analogous to a wise use of cache lines

The combined effect of wide cache utilization together with SIMD vectorization comprise the bulk of acceleration provided by NumPy functions and operators.

The question now becomes, how to I coax my Python code with all those loops into the more efficient NumPy vectorized form? You do it by “loop replacement”! Replace large trip count loops or multiple nested loops with an overall trip count and replace these loops with a single or a couple vector instructions instead.

Search and replace loops with NumPy functions and operators

Consider the simple loop centric approach to computing the sin on each element of a list of 10,000 elements in Figure 8 below.

a = []
t1 = time.time()
timing = {}
for i in range(10_000):
a.append(np.sin(i))
t2 = time.time()
print(f"With for loop and appending is took {t2-t1} seconds"
timing['loop'] = (t2-t1)
# Figure 8. Computing the sin on each element of a list

One way, but not recommended way, of using NumPy is to replace the code above with what is shown in Figure 9 below.

a = np.array([])
t1 = time.time()
for i in range(10_000):
a.append(np.sin(i))
t2 = time.time()
print(f"With for loop and appending NumPy is took {t2-t1} seconds"
timing['badNumPyLoop'] = (t2-t1)
# Figure 9. Naïve and slow approach to replacing the "functionality" of the code in Figure 8.

In Figure 9, we just replaced the intent of each line of code from Figure 8. This is not the goal. Instead, the goal is to replace the loop.

The code in Figure 9 was slower than the original python loop, on my system by a factor of more than 3x time slower.

Now let us achieve the goal of replacing the loop with a SIMD vectorized friendly version of Figure 8.

In Figure 10 below, I show how to write more efficient NumPy code that replaces the original loop (well hides it) in optimized “C” oneMKL versions underpinning NumPy.

a = np.linspace(0, 10_000, num=10_000 + 1)
t1 = time.time()
a = np.sin(a)
t2 = time.time()
print(f"With linspace and unfunc is took {t2-t1} seconds"
timing['numpy'] = (t2-t1)
# Figure 10. Removing te loop is more readable and faster.

Try it yourself. Create two Jupyter notebook cells, one with code from Figure 8 and the other from Figure 10. Plot your acceleration (hint — you might want to use a semi log plot to capture your performance gains). You will be happily surprised at your results.

In future articles in this series, I will demonstrate the value of various NumPy functions and operators which accelerate and simplify code in a similar manner by taking advantage of Intel® oneMKL under the hood!

In the new series, I plan to discuss the following topics:

1) NumPy UFUNCS
2) NumPy Aggregations
3) Numpy Broadcasting and Fancy slicing
4) Numpy Sorting
5) NumPy Where and Select to vectorize loops with conditional logi
6) Applications to Pandas Apply statements
7) What to do when the loops get large and complex

References:

--

--

Bob Chesebrough

Robert Chesebrough is currently a Solution Architect in the Intel Developer Academy where he teaches others to apply optimizations to data science algorithms.