# Stop using df.iterrows()

This morning I came across an article with tips for using Pandas better. One of the claims was that `df.itertuples()`

should be used instead of `df.iterrows()`

. Since I had not used (or heard of) `df.itertuples()`

, I thought I’d crack open my Jupyter Notebook and try it out.

Warning: the following was done on my train commute as a proof of concept. May not be the most scientific explanation for why you too should prefer `df.itertuples()`

.

**TL;DR**: a common way to iterate a Pandas dataframe is using `df.iterrows()`

which yields a tuple with the row index and the series (row). Although so-called Pandas experts will tell you this is much better (and more “pythonic”?) than `for i in range(df.shape[0]):`

, it turns out there’s a better way: `df.itertuples()`

.

# My experiment:

Given a dataframe with a few thousand rows, I iterate each row and perform some silly operation per iteration:

`times_r = []`

times_t = []

n = 1000

for i in range(250):

# Iterate using iterrows()

begin = time.time()

data = {}

for row in df[:sz].iterrows():

row = row[1]

key = row.firstName[:2]

if key not in data:

data[key] = [0]

data[key][0] = data[key][0] + 1

end = time.time()

times_r.append({'begin': now, 'end': end, 'diff': end - begin})

# Iterate using itertuples()

begin = time.time()

data = {}

for row in df[:sz].itertuples():

key = row.firstName[:2]

if key not in data:

data[key] = [0]

data[key][0] = data[key][0] + 1

end = time.time()

times_t.append({'begin': now, 'end': end, 'diff': end - begin})

# The results:

# A possible explanation for this:

Looking at these last two plots, it looks apparent that `df.iterrows()`

takes longer and longer as the size of the dataframe grows. In other words, its big O value is linear. In contrast, `df.itertuples()`

remains consistent regardless of the size of the dataframe: big O(1).

# Final words

Iterating a Pandas dataframe using `df.itertuples()`

seems like a simple and effective alternative to `df.iterrows()`

because it runs consistent regardless of the size of the dataframe.

Worth mentioning: `df.iterrows()`

yields a Pandas Series, and `df.itertuples()`

yields a named tuple. The API is still the same to access elements in the object yielded at each iteration.

Also, wrapping `enumerate`

doesn’t affect the performance of `df.itertuples()`

, though I didn’t take the time to post the plots for my experiments with that.