Choose the super speed iteration to process large data sets.

Arunkumar N
Variablz Academy
Published in
5 min readNov 14, 2022

Data Science is a fantasy job until you enter big data analytics ๐Ÿ˜…๐Ÿ˜…๐Ÿ˜…. I went through many challenges while working on my first big data project, and I realized the most critical challenge I had to deal with was time consumption. So I did much research to handle some processes efficiently and thought to share some of my research here.

In this article, we will see different methods for iterating over a pandas data frame and explore which way is faster and more efficient while handling larger datasets.

You can take any dataset that requires mathematical operations and functions. I have imported pandas for data analysis and NumPy for numerical operations through arrays.

I checked the number of rows and columns using shape and filtered the needed columns for easy operations and visibility.

I created a function for division operation as I needed the ratio between the Indian and USD prices. We will make a ratio column throughout all methods and check for their run time.

I have copied the original data frame by values to explain each method and for more clarity.

%timeit function

It is one of the magic commands of the python Jupyter notebook used to measure the execution time of the code in the best possible way by taking the mean and standard deviation of the number of run times and providing us with the best runtime result.

Iterrows

As you can see, the runtime of the Iterrows method is probably one of the most extended run times I have ever seen after optimization. This method iterates over rows and returns index, series Paris. Here I have ignored the index by using (_,). Be careful while using this method as it does not preserve the datatype across the rows.

Here I have used Iterrows inside list comprehension for iterating over the โ€˜priceโ€™ and โ€˜USD priceโ€™ columns and does the divide operation by using the divide function as mentioned earlier and creating a โ€˜ratio1โ€™ column in the df_iter_rows data frame. I have run it two times and created two columns for experimental purposes. As the run time is more extensive for iterrows, the time it functions runs is only 7*1 times.

Apply

You can use apply functions in a variety of ways. Here I have used lambda inside the apply function to do operations between two columns. Usually, apply function is used for convenience, not performance, as you can see in the execution time below. It is faster than Iterrows but not faster enough.

itertuples

This method is 100 times faster than the iterrows and 15 times faster than the apply() methods as it converts the data frame into a list of tuples and iterates over it.

Zip

We use the membership function โ€” โ€˜inโ€™ wherever possible since it is precise and faster. Here I have used the Zip function to perform a division operation between two columns. This method gives us faster results than previous ones, and you can also note that it runs 7*10 times as the response to the Zip function much quicker.

Map

It is one of the most efficient python functions because it is precise, faster, and consumes less memory. It also works very well in handling strings. It is much quicker when compared to previous methods. You can see that Zip and map have similar run times, but the map is a little bit faster.

Vectorize

It is one of the NumPy operations to map a function. It uses broadcasting methods of NumPy for optimizing memory and run time. When it comes to performance, it performed well when compared to previous methods. Itertuples, Zip, Map, and Vectorize all fall under similar run time zones run 7*10 times.

Conclusion

From the above methods, we can conclude that Vectorize is faster for numerical operations. Zip and Map will be the go-to method in most cases, like handling many other functions having strings instead of numbers. Itertuples are also faster but not faster enough, like zip and map. I wonโ€™t recommend using iterrows and applying when the data frame size is larger since it is too slow.

I hope these methods will definitely help you while iterating the Pandas data frames. Still, there are even faster and more efficient methods yet to cover. We will see those things in the upcoming article.

For more data science insights, connect with me on LinkedIn. https://www.linkedin.com/in/arunkumar-data-scientist/

Thanks & Regards

Arun Kumar

--

--