Understanding Vectorization in NumPy and Pandas

Published in

Analytics Vidhya

5 min readMar 1, 2021

Just days in to hands-on learning data manipulation with Pandas, my instructor paused to make a point. “Do yourself a favor,” he said to the class, with more intention than ever before, “before going too much further in learning Pandas, watch this talk on vectorization.”

The value of vectorization seemed apparent, both from our instructor’s affect when he was directing us to the clip, and from the claim that the presenter in the clip was suggesting—vectorize your code to manipulate data 1000 times faster. The video breaks down several examples of using a variety of manipulation operations—Python for-loops, NumPy array vectorization, and a variety of Pandas methods—and compares the speed that outputs are returned for such methods. The results are clear: using techniques that take advantage of vectorization in Pandas would result in, just as the video’s click-attracting headline suggests, staggeringly faster data manipulation.

As a person who appreciates learning the best technique to do any new skill, I immediately knew that I would need to adapt my approach. My infatuation with Python for-loops was going to need to play second fiddle to these enticing vector methods. But first…

What the hell is a vector?

With a background in animation, I originally understood vectors not in the context of math or programming, but rather in creating digital imagery. Design programs Adobe Illustrator and Adobe Animate are vector-based—the artwork you create in these programs are built of a series of points, lines, curves and shapes that allows the artwork to scale without quality deterioration, with the added benefit of keeping the size of the file small.

In contrast, Adobe Photoshop was raster-based—a series of small pixels create the image. Painting in Photoshop felt natural, not like building an image out of shapes. Of course, creating art in such a program had its drawbacks: the capability to work on large-format art with fine detail would be heavily dependent on the computer, and scaling up art to a larger format at the cost of degrading quality is non-negotiable for most artists.

That did little to explain precisely just what vectorization is. The video above describes what vectors do—operate on elements of an array or series datatype all at once, and also suggests that NumPy relies on the programming language C to do its vectorization operations. In mathematics, a vector is something that has magnitude and direction. In programming and computer science, vectorization is the process of applying operations to an entire set of values at once.

These definitions still don’t quite offer a clear explanation for how this happens, which is worth investigating in order to connect a bit deeper with the process.

Why It’s Faster

While for-loop syntax in Python is flexible and provides wonderful utility, each iteration over an element is essentially a single step in the route through all elements of the container object. This step-through processing is useful when the order of operation matters (e.g., returning the first item in a list that meets a certain condition).

Vectorized processing, in contrast, may be applied when the order of processing does not matter. As suggested above, the built-in methods in NumPy and Pandas are built with C, which allows for vectorization. Vectorization almost always works faster as execution time is either constant, or grows at a much slower rate with a larger number of elements.

Parallel Processing

In NumPy and Pandas, separate segments of arrays are processed amongst all of the processing cores of your computer. NumPy and Pandas operate on their arrays and series in parallel, with a segment of each array being worked on by a different core of your computer’s processor.

Like-Datatypes

NumPy arrays are set to a single datatype. Likewise with series in Pandas — each column will be of type int, float, str, or datetime. This allows for optimization of data processing, as the contents of these containers are certain to be able to be manipulated in like-manner.
This is not the case with Python’s built-in container data-types, such as lists, sets, and dictionaries. These types allow you to store a variety of types within them at the same time. A list may contain strings, ints, floats, other lists, etc.

Locality

NumPy takes your array matrix and stores it in one area of your memory. Contents being local to each other allow them to be operated on faster.
In contrast, Python lists may have its contents stored distant from each other within your memory.

The Mechanism Behind Vectorization — SISD vs SIMD

Modern computer processors contain components that have particular computer architecture classifications that are relevant to understanding vectorization:

SISD — Single Instruction, Single Data
SIMD—Single Instruction, Multiple Data

Visualizing SISD vs SIMD component processes

SISD: This is the structure for how Python for-loops are processed—One instruction, per one data element, per one moment in time, in order to produce one result. The neat thing about this is that it is flexible — you may implement any operation on your data. The drawback is that it is not optimum for processing large amounts of data.
SIMD: This is the structure for how NumPy and Pandas vectorizations are processed—One instruction per any number of data elements per one moment in time, in order to produce multiple results. Contemporary CPUs have a component to process SIMD operations in each of its cores, allowing for parallel processing.

Now that we know the mechanism and concepts behind the speed of vectorization, understanding how to approach data manipulation with Pandas might require a little direction. The Zen of Pandas Optimization, per Sofia Heisler’s address to Pycon 2017, offers a concise approach:

The Zen of Pandas Optimization

- Avoid loops, if you can
- If you must loop, use apply, not iteration functions
- If you must apply, use Cython to make it faster
- Vectorization is usually better than scalar operations
- Vector operations on NumPy arrays are more efficient than on native Pandas series