Stop using df.iterrows()

Rodrigo Silveira
3 min readAug 9, 2019

--

This morning I came across an article with tips for using Pandas better. One of the claims was that df.itertuples() should be used instead of df.iterrows(). Since I had not used (or heard of) df.itertuples(), I thought I’d crack open my Jupyter Notebook and try it out.

Warning: the following was done on my train commute as a proof of concept. May not be the most scientific explanation for why you too should prefer df.itertuples().

TL;DR: a common way to iterate a Pandas dataframe is using df.iterrows() which yields a tuple with the row index and the series (row). Although so-called Pandas experts will tell you this is much better (and more “pythonic”?) than for i in range(df.shape[0]):, it turns out there’s a better way: df.itertuples().

My experiment:

Given a dataframe with a few thousand rows, I iterate each row and perform some silly operation per iteration:

times_r = []
times_t = []
n = 1000
for i in range(250):
# Iterate using iterrows()
begin = time.time()
data = {}
for row in df[:sz].iterrows():
row = row[1]
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_r.append({'begin': now, 'end': end, 'diff': end - begin})

# Iterate using itertuples()
begin = time.time()
data = {}
for row in df[:sz].itertuples():
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_t.append({'begin': now, 'end': end, 'diff': end - begin})

The results:

df.iterrows() took a mean of 19ms to complete
df.itertuples() took a mean of 0.03ms to complete

A possible explanation for this:

Looking at these last two plots, it looks apparent that df.iterrows() takes longer and longer as the size of the dataframe grows. In other words, its big O value is linear. In contrast, df.itertuples() remains consistent regardless of the size of the dataframe: big O(1).

Final words

Iterating a Pandas dataframe using df.itertuples() seems like a simple and effective alternative to df.iterrows() because it runs consistent regardless of the size of the dataframe.

Worth mentioning: df.iterrows() yields a Pandas Series, and df.itertuples() yields a named tuple. The API is still the same to access elements in the object yielded at each iteration.

Also, wrapping enumerate doesn’t affect the performance of df.itertuples() , though I didn’t take the time to post the plots for my experiments with that.

--

--

Rodrigo Silveira

Software engineer fascinated with learning, practicing, and teaching.