How to efficiently loop through Pandas DataFrame

Wei Xia
The Startup
Published in
10 min readDec 9, 2019

--

If working with data is part of your daily job, you will likely run into situations where you realize you have to loop through a Pandas Dataframe and process each row. I recently find myself in this situation where I need to loop through each row of a large DataFrame, do some complex computation to each row, and recreate a new DataFrame base on the computation results. Savvy data scientists know immediately that this is one of the bad situations to be in, as looping through pandas DataFrame can be cumbersome and time consuming.

However, in the event where you have no option other than to loop through a DataFrame by rows, what is the most efficient way? Let’s look at the usual suspects:

  • for loop with .iloc
  • iterrows
  • itertuple
  • apply
  • python zip
  • pandas vectorization
  • numpy vectorization

When I wrote my piece of code I had a vague sense that I should stay away from iloc, iterrows and try using pandas builtin functions or apply. But I ended up using itertuples because what I was trying to do is fairly complex and could not be written in the form that utilizes apply. It turned out that using itertuple did not satisfy my time constraint, so I Googled around and found the above list of candidates I could potentially try out. I decided to try each of them out and record my findings as well as the reason why some options are more…

--

--