Stop using df.iterrows()

Rodrigo Silveira
Aug 9, 2019 · 3 min read

This morning I came across an article with tips for using Pandas better. One of the claims was that df.itertuples() should be used instead of df.iterrows(). Since I had not used (or heard of) df.itertuples(), I thought I’d crack open my Jupyter Notebook and try it out.

Warning: the following was done on my train commute as a proof of concept. May not be the most scientific explanation for why you too should prefer df.itertuples().

TL;DR: a common way to iterate a Pandas dataframe is using df.iterrows() which yields a tuple with the row index and the series (row). Although so-called Pandas experts will tell you this is much better (and more “pythonic”?) than for i in range(df.shape[0]):, it turns out there’s a better way: df.itertuples().

My experiment:

Given a dataframe with a few thousand rows, I iterate each row and perform some silly operation per iteration:

times_r = []
times_t = []
n = 1000
for i in range(250):
# Iterate using iterrows()
begin = time.time()
data = {}
for row in df[:sz].iterrows():
row = row[1]
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_r.append({'begin': now, 'end': end, 'diff': end - begin})

# Iterate using itertuples()
begin = time.time()
data = {}
for row in df[:sz].itertuples():
key = row.firstName[:2]
if key not in data:
data[key] = [0]
data[key][0] = data[key][0] + 1
end = time.time()
times_t.append({'begin': now, 'end': end, 'diff': end - begin})

The results:

df.iterrows() took a mean of 19ms to complete
df.itertuples() took a mean of 0.03ms to complete

A possible explanation for this:

Looking at these last two plots, it looks apparent that df.iterrows() takes longer and longer as the size of the dataframe grows. In other words, its big O value is linear. In contrast, df.itertuples() remains consistent regardless of the size of the dataframe: big O(1).

Final words

Iterating a Pandas dataframe using df.itertuples() seems like a simple and effective alternative to df.iterrows() because it runs consistent regardless of the size of the dataframe.

Worth mentioning: df.iterrows() yields a Pandas Series, and df.itertuples() yields a named tuple. The API is still the same to access elements in the object yielded at each iteration.

Also, wrapping enumerate doesn’t affect the performance of df.itertuples() , though I didn’t take the time to post the plots for my experiments with that.

Rodrigo Silveira

Written by

Software engineer fascinated with learning, practicing, and teaching.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade