Improve data processing speeds while using numpy arrays in Python through vectorization

Shivaditya Meduri
Analytics Vidhya
Published in
5 min readSep 10, 2021

--

Unsplash

Many internet companies nowadays use python in their backend server applications. When dealing with huge amounts of data, for example,an application that creates dashboards for large datasets given by the user, it is important to note that data processing is the bottleneck in serving the user fast responses. Also in data science applications while pre-processing the data and running feature generation functions or cleaning operations through out the records of the dataset can be time consuming for large number of records. Vectorization is what gives python performance speeds parlelling C programming language, literally, because numpy’s vectorized functions uses optimized precompiled C code under the hood.

What is vectorization?

Python is a dynamically typed programming language, meaning that type checking of data happens at runtime. C is a statically typed programming languages, so the type checking happens during compile time. Type checking during execution is a time consuming process and is the reason behind python’s slow performance when compared to C, especially in for-loops. Numpy’s vectorized functions don’t perform explicit type checks for each iteration saving valuable CPU and GPU resource time. Vectorization essentially means that function is now applied paralelly on many values of the iterable unlike traditional for-loops. CPUs use SIMD (Single Instruction Multiple Data) to achieve faster speeds taking advantage of more number of cores and parallelism.

Unsplash

In this article I will discuss the most popular vectorized functions available in numpy library, compare the speeds in my local computer when compared to traditional for-loops, and how we can vectorize a custom function

Vectorization of transformation operations

Transforming data is an essential part of data preprocessing. For example, scaling the images pixel values between 0 and 1 for Neural Nets, log transformations of certain features of the dataset before applying the regression model, Normalization of features of a dataset, and many more.

Let us compare the speeds of traditional for-loops and vectorization using the dataset of MNIST from the sklearn library. MNIST data in sklearn has the shape 1797x64. Each image is stored in 1d numpy array of length of 64

Scaling images between 0 and 1 for deep learning applications########################################################Using traditional for-loops
s = mnist.shape
tic = time.time()
for i in range(s[0]):
for j in range(s[1]):
mnist[i][j] = mnist[i][j]/255
toc = time.time()
print("Time taking with training loop is : {0}ms".format((toc-tic)*1000))
Output - Time taking with training loop is : 96.99487686157227ms#Using vectorization
tic = time.time()
mnist = mnist/255
toc = time.time()
print("Time taking with Vectorization is : {0}ms".format((toc-tic)*1000))
Output - Time taking with Vectorization is : 6.717443466186523ms

We can clearly see that there is 10x improvement in time-taken for scaling. The rate of improvement will depend on your local computer, but you will definitely witness a significant improvement in execution time

+, -, *, / operations vectorized versions can be performed by directly performing the operation on the numpy array

print(arr * 3) #Multiply array by a number
print((arr/4+100)*3) #Divide each element by 4, then +100 and /3
#Performing vectorized operations on 2 arrays
print(arr1 + arr2) # Element wise addition
print(arr1 * arr2) # Element wise multiplication
print((arr1-arr2)**2) # Eucledian distance
#Ensure the 2 arrays have the same shape

We can save lines of code and execution time for the basic operations by applying them directly between arrays and scalars(int, float objects).

Now let’s see about log transformations. Log transformations are essential while performing regression especially in case of skewed data

Boston housing data features pairwise plot

Let’s apply log transformations of Boston housing data(Can be found in Kaggle) Log transformation essentially means apply log to each and every item in the dataset

#Using traditional for-loop
s = data.shape
tic = time.time()
for i in range(s[0]):
for j in range(s[1]):
data[i][j] = math.log(data[i][j])
toc = time.time()
print("Time taken with For-loop: {0}ms".format((toc-tic)*1000))
Output - Time taken with For-loop : 2.0296573638916016ms
#Using vectorized numpy function
tic = time.time()
data = np.log(data)
toc = time.time()
print("Time taken with vectorization: {0}ms".format((toc-tic)*1000))
Output - Time taken with Vectorization : 0.0ms

Vectorized version of log barely took any time, it was so fast the smallest unit of time.time() is not able to measure the difference. Try it out in your own computer to see the difference

Numpy provides a range of vectorized transformation functions. All of them can be found in the link https://numpy.org/doc/stable/reference/routines.math.html

Vectorization of Linear Algebra operations

Linear Algebra operations like Matrix Multiplication is used in various applications like neural networks.

Inputs passing through a layer of neural net is nothing but matrix multiplication

In neural net applications using vectorized functions will significantly reduce train time and execution time in production as well.

A = np.random.randn(10000).reshape((100, 100))
B = np.random.randn(10000).reshape((100, 100))
C = np.zeros((100, 100))
tic = time.time()
#Using traditional For-Loops
for i in range(100):
for j in range(100):
for k in range(100):
C[i][j] += A[i][k]*B[k][j]
toc = time.time()
print("Time taken by For-Loop: {0}ms".format((toc - tic)*1000))
Output - Time taken by For-Loop: 1195.0178146362305ms#Using vectorized function of numpy
tic = time.time()
C = np.dot(A, B)
toc = time.time()
print("Time taken by vectorization: {0}ms".format((toc - tic)*1000))
Output - Time taken by vectorization: 1.0304450988769531ms

It is a huge improvement using the vectorized function. We can clearly see why deep learning applications prefer to vectorize the code, it can save a lot of time model training and deployment

Visit https://numpy.org/doc/stable/reference/routines.linalg.html to see all the linalg functions provided by the numpy library

Vectorizing a custom function

Though there are a lot of vectorized functions provided by the numpy library, we will need to apply a custom function for our application throughout the dataset which is not available in the numpy library. There is a way through which we can vectorize custom functions using numpy library

Break down the custom function into vectorized functions available in the numpy library. For example,

A = np.random.randn(1000)
B = np.zeros(1000)
def func(a):
return a**2 + 5*a +35
for i in range(1000):
B[i] = func(A[i])
#We can vectorize the above func as followsB = A**2 + 5*A +35

If we are unable to break the custom function to vectorized functions in numpy, we can also use the numpy.vectorize function

A= np.random.randn(10000)
C = np.zeros(10000)
def f(a):
if a>10:
return a**2
else:
return a
tic = time.time() #Using for-loops
for i in range(10000):
C[i] = f(A[i])
toc = time.time()
print("Time taken: {0}ms".format((toc-tic)*1000))
Output - 10msv = np.vectorize(f) #Vectorizing the function
tic = time.time()
C = v(A)
toc = time.time()
print("Time taken: {0}ms".format((toc-tic)*1000))
Output - 2ms

Conclusion

While preprocessing numerical data, where speed is a priority, it is recommended to use numpy functions which are generally vectorized, and use precompiled C code for superior performance. This will improve your backend APIs response time significantly when there is a huge amount of data involved.

Find the jupyter notebook for this article in the following link: https://github.com/shivaditya-meduri/Articles/blob/ac614ff77bad275834b211813ae0fa2f946aa3e1/Vectorization.ipynb

--

--