Choose the super speed iteration to process large datasets-Part 2

Arunkumar N
Variablz Academy
Published in
4 min readNov 22, 2022
Super Speed Iteration — Part 2 (Credits: Aatomz Research)

This is the continuation of Part 1 of the article "Choose the super speed iteration to process large datasets.” The following 6 methods are the most efficient, considering their performance compared to previous methods. So let’s go and enjoy the process.

1. Eval

From the execution time seen below, we can say that the Eval is one of the dynamic functions of Pandas, especially for doing arithmetic operations and comparison operations. The performance will be faster only when the data frame contains more than 10k rows. This method will be beneficial when dealing with huge datasets. Because the execution time is less, the timeit function executes the operation 7*100 times.

2. Assign

This method is even faster when compared to Eval. We can assign many columns in the existing data frame by using this function. You can see that Eval and Assign functions run 7*100 times by the timeit function. Both functions have approximately similar run times.

3. Div

We came to the final four functions. With my research and experimentation, I have found that these four methods run 7 * 1000 times and give us the best possible run-time result we can get, which is the mean run-time result. This method, one of the Pandas data frame attributes, does the normal scalar arithmetic division of the columns or between two columns.

4. Pandas Vectorization

For numerical operations in a Pandas data frame with millions of rows, vectorization is the way to go. Especially for homogeneous operations, vectorization is very useful. In this case, the "price" and "USD price" columns have similar datatypes; hence, during vectorization, an array of price values and USD price values is created, and a division operation takes place simultaneously between the two arrays. So this method is more efficient and gives us faster results. The timeit function runs it 7 * 1000 times, as shown below.

5. Numpy Vectorization

If the vectorization happens in Numpy arrays, it is Numpy vectorization. The Numpy arrays are dynamic and faster than Pandas arrays, as the Numpy array uses C. The execution time of the code is in microseconds, and that tells you how effective this method

6. Values

This method returns the values of columns in the NumPy array, ignoring the labels of rows and columns. Numpy vectorization and values do similar functions, but the values method is a bit faster.

Conclusion

Here is the performance comparison for all methods.

Performance comparison of Iteration Methods

There have been a lot of methods for doing the same operation. But which method is powerful and efficient in saving your precious time? Through this research and experimentation, I have found out that the “.values” method has a faster execution time compared to other methods. But I would suggest you use “.to_numpy()” as the official documentation of Pandas' recommended use instead of “.values” as it is deprecated.

I have done this research only on the numerical values. Still, there is much to know about how vectorization handles strings, and various operations are yet to be explored. Query the universe and learn more. Never give up on anything. Until the following article, Take care.

For more data science insights, connect with me on LinkedIn.

https://www.linkedin.com/in/arunkumar-data-scientist/

Thanks & Regards

Arun Kumar

--

--