Stop using df.iterrows()
This morning I came across an article with tips for using Pandas better. One of the claims was that df.itertuples()
should be used instead of df.iterrows()
. Since I had not used (or heard of) df.itertuples()
, I thought I’d crack open my Jupyter Notebook and try it out.
Warning: the following was done on my train commute as a proof of concept. May not be the most scientific explanation for why you too should prefer df.itertuples()
.
TL;DR: a common way to iterate a Pandas dataframe is using df.iterrows()
which yields a tuple with the row index and the series (row). Although so-called Pandas experts will tell you this is much better (and more “pythonic”?) than for i in range(df.shape[0]):
, it turns out there’s a better way: df.itertuples()
.
My experiment:
Given a dataframe with a few thousand rows, I iterate each row and perform some silly operation per iteration:
times_r = []
times_t = []
n = 1000
for i in range(250):
# Iterate using iterrows()
begin = time.time()
data = {}
for row in df[:sz].iterrows():
row = row[1]
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_r.append({'begin': now, 'end': end, 'diff': end - begin})
# Iterate using itertuples()
begin = time.time()
data = {}
for row in df[:sz].itertuples():
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_t.append({'begin': now, 'end': end, 'diff': end - begin})
The results:
A possible explanation for this:
Looking at these last two plots, it looks apparent that df.iterrows()
takes longer and longer as the size of the dataframe grows. In other words, its big O value is linear. In contrast, df.itertuples()
remains consistent regardless of the size of the dataframe: big O(1).
Final words
Iterating a Pandas dataframe using df.itertuples()
seems like a simple and effective alternative to df.iterrows()
because it runs consistent regardless of the size of the dataframe.
Worth mentioning: df.iterrows()
yields a Pandas Series, and df.itertuples()
yields a named tuple. The API is still the same to access elements in the object yielded at each iteration.
Also, wrapping enumerate
doesn’t affect the performance of df.itertuples()
, though I didn’t take the time to post the plots for my experiments with that.