The Craziness of Subset Selection in pandas — an Edge Case
Selecting subsets of data in pandas is not a trivial task as there are numerous ways to do the same thing. Different pandas users select data in different ways, so these options can be overwhelming. I wrote a long 4-part, 100-page series on it to clarify how its done. For instance, take a look at the following options for selecting a single column of data (assuming it’s the first column):
df[‘colname’]
df[[‘colname’]]
df.colname
df.loc[:, ‘colname’]
df.iloc[:, 0]
df.get(‘colname’)
Become an Expert
If you want to be trusted to make decisions using pandas, you must become an expert. I have completely mastered pandas and have developed courses and exercises that will massively improve your knowledge and efficiency to do data analysis.
- Master Data Analysis with Python — My comprehensive course with 800+ pages, 500+ exercises, video lessons, multiple projects, and detailed solutions that will help you become an expert at pandas.
- Get a sample of the material by enrolling in the free Intro to Pandas course.
Summary of this post
In this post, I want to cover a single edge case of subset selection that I believe most pandas users will be unaware of what it does and how it works. Let’s say we have a DataFrame df
and issue the following subset selection.
df[1, 2]
Deceptively simple
This appears to be quite a simple subset selection. There are so few characters on the screen. How difficult can this get? If you are a casual user of pandas, you might think that this must be something that you can figure out its meaning.
Even if you don’t know pandas well or at all, but had to guess what this selected, you might think something along the lines of ‘the value located at the first row and second column’.
Attempt to select on a ‘normal’ DataFrame
Let’s take a look at a ‘normal’ DataFrame with string names as columns and attempt to make the selection df[1, 2]
.
We are met with a KeyError
which is what you get when you attempt to select a column not in the DataFrame. This is typically triggered when you misspell a column name like this:
Tuples as column names
Oddly enough, tuples are allowable as valid column names in a pandas DataFrame. The KeyError
informs us that the tuple (1, 2)
is not a column in your DataFrame. Yes, that is correct, it’s looking for the tuple (1, 2)
as a column for the DataFrame.
Let’s create a DataFrame with a tuple as a column name:
The first column name is the tuple (1, 2)
. Any hahsable object is allowable as a column name.
Repeat selection
Let’s repeat our original selection, df2[1, 2]
.
This successfully selects the first column from our DataFrame as a Series.
The rules for just the brackets
I use the terminology just the brackets to describe subset selection when the brackets are appended directly to a DataFrame or Series variable name. This helps differentiates it from the loc
and iloc
indexers which also use the brackets.
pandas has specific rules that you must know to use just the brackets correctly. The behavior of just the brackets changes based on what you place inside of it. Here are the rules for different objects
Slice
Select rows based on integer location or label. df[2:5]
selects rows with integer location 2 to 4.
df['Niko':'Penelope']
selects all rows beginning at label ‘Niko’ up to and including the row labeled by ‘Penelope’
List
Select each column in the list and return a DataFrame. df[['age', 'height']]
selects the columns ‘age’ and ‘height’ as a DataFrame.
Boolean Series or List
If you pass in a Series or list of all boolean values, pandas uses those booleans to select only the rows where True is located.
Any other object
Supplying any other object will have pandas attempt to select that column as a Series. For instance, passing the string ‘height’ to the brackets selects the column ‘height’ as a Series.
Proving just the brackets with an object that is not a column name raises a KeyError
. Trying to select the boolean value True
(not to be confused with the string ‘True’) produces a KeyError
.
What does df[1, 2] do?
Attempting the selection df[1, 2]
falls into the ‘any other object’ category from above. It is not a slice, and it is not a list. The 1, 2
with just the brackets is received by pandas as tuple.
Why is it received as a tuple?
In order to understand why pandas receives this object as a tuple, you must understand how the __getitem__
special method works in Python. If you define this special method for your object, then the brackets work as if they were a method that accepts a single parameter. Whatever is inside the brackets is treated as a single parameter and is passed to the __getitem__
special method.
Let’s show how some of the subset examples using the brackets get translated into a call to the __getitem__
special method.
df[2:5]
turns todf.__getitem__(slice(2, 5))
df[‘Niko’:’Penelope’]
becomesdf.__getitem(slice(‘Niko’, ‘Penelope’))
df[‘height’]
becomesdf.__getitem__(‘height’)
You might be asking, “isn’t df[1, 2]
passing two separate arguments to __getitem__
?” The answer is “no”. It treats 1, 2
as a tuple and passes that tuple as a single argument to the __getitem__
special method. Therefore, df[1, 2]
becomes df.__getitem__((1, 2))
. pandas receives the tuple. It is not a slice, and not a list, therefore it looks to see if this object is a column name. It is not a column name in the df
DataFrame and raises a KeyError
.
Our other DataFrame, df2
, does have a column name equal to the tuple (1, 2)
so it gets selected as a Series.
There’s more — MultiIndex Selection
The rules change again for just the brackets whenever you have a MultiIndex for the columns. If you pass in a tuple, it will use the first item in the tuple as the value for the columns in the top level. It takes the second item in the tuple as the value for the columns in the next level.
Take a look at the following DataFrame with a two-level MultiIndex. There are six total columns. The top level has two values — the integers 1 and 2.
We select all of the columns with top level value equal to 1 like this:
To select a single column in this MultiIndex DataFrame, use a tuple just like we did above.
Summary
Just the brackets makes subset selections on a pandas DataFrame and changes its behavior based on the object passed to it. Below are the objects it accepts and what it returns.
- slice — rows
- list — columns
- boolean Series or list — rows
- anything else — a single column (or multiple columns if it’s a tuple on a MultiIndex DataFrame)
Master Python, Data Science and Machine Learning
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:
- Master the Fundamentals of Python — A comprehensive introduction to Python (300+ pages, 150+ exercises, 25 hours of video)
- Master Data Analysis with Python — The most comprehensive course available to learn pandas. (800+ pages and 500+ exercises, 10 hours of video)
- Master Machine Learning with Python — A deep dive into doing machine learning with scikit-learn constantly updated to showcase the latest and greatest tools. (300+ pages)