5 Useful Tips for Selecting Subset By Index Label

Sasanka C
The Startup
Published in
7 min readOct 7, 2020

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. In this post, I will show you how to “Select a Subset Of Data Using Index Labels Effectively” using the index label.

Image Credit — Unsplash

Remember, Python dictionaries and lists are built-in data structures that select their data either by using the index label or by index position. A dictionary’s key must be a string, integer, or tuple while a List must either use integers (the position) or slice objects for selection.

Pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.

Pandas have .loc and.iloc attributes available to perform index operations in their own unique ways. ). With.iloc attribute, pandas select only by position and work similarly to Python lists. The .loc attribute selects only by index label, which is similar to how Python dictionaries work.

Select a Subset Of Data Using Index Labels (.loc)

Image Credit — Unsplash

The loc method is available on both Pandas Series and DataFrame.

Import the movie dataset with the title as an index.

>>> movies = pd.read_csv("movies_data.csv",
... index_col="title",
... usecols=["title","budget","vote_average","vote_count"])

I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you are dealing with a huge dataset when your index is sorted.

1. What if I don’t sort the index?

No problem your code is going to run forever. Just kidding, well if the index labels are unsorted then pandas have to traverse through all the labels one by one to match your query. Just imagine an Oxford dictionary without an index page, what you are going to do? With the index sorted you can jump around quickly to a label you want to extract, so is the case with Pandas too.

Read till the end, I will show you a cool trick when your index is sorted.

>>> movies.sort_index(inplace = True)
>>> movies.head(3)
budget vote_average vote_count
title
(500) Days of Summer 7500000 7.2 2904
10 Cloverfield Lane 15000000 6.8 2468
10 Days in a Madhouse 1200000 4.3 5

I have sorted the index using sort_index and “inplace = True” parameter.

2. Guessing the syntax style.

One thing interesting about the syntax of the loc method is that it does not take parenthesis() rather takes square brackets[]. I think (might be wrong) this is because they wanted consistency with data extraction i.e. you can use [] on a Series to extract rows, while applied on a Dataframe will fetch you the columns.

Enough guessing, back to code. Enter the label you want in single quotes and see the output yourself.

>>> # extract "Spider-Man 3" ( I'm not a big fan of spidy)
>>> movies.loc["Spider-Man 3"]
budget 258000000.0
vote_average 5.9
vote_count 3576.0
Name: Spider-Man 3, dtype: float64

Key error — Throwing An ugly blob for “when is next Avengers movie“

If the index label you passed is not found, pandas will throw at your face an ugly blob. Let me show you how.

>>> movies.loc["when is Avengers next movie?"]
Traceback (most recent call last):
KeyError: 'when is Avengers next movie?'

3. Selecting More with a slice

Use a slice to pull out many values. I’m going to pull the movies which I haven’t watched. Because this is a string label we are going to get all the data for our search criteria including “Avatar”.

Remember — If you work with Python List the last value is excluded but since we are working with strings it’s inclusive.

>>> movies.loc["Alien":"Avatar" ]
budget vote_average vote_count
title
Alien 11000000 7.9 4470
Alien Zone 0 4.0 3
Alien: Resurrection 70000000 5.9 1365
Aliens 18500000 7.7 3220
Aliens in the Attic 45000000 5.3 244
... ... ... ...
Australia 130000000 6.3 694
Auto Focus 7000000 6.1 56
Automata 7000000 5.6 670
Autumn in New York 65000000 5.7 135
Avatar 237000000 7.2 11800
[167 rows x 3 columns]

There is more you can do with a slice.

>>> # fetch all the movies until you reach "Avatar"
>>> movies.loc[:"Avatar" ]
budget vote_average vote_count
title
(500) Days of Summer 7500000 7.2 2904
10 Cloverfield Lane 15000000 6.8 2468
10 Days in a Madhouse 1200000 4.3 5
10 Things I Hate About You 16000000 7.3 1701
102 Dalmatians 85000000 5.1 313
... ... ... ...
Australia 130000000 6.3 694
Auto Focus 7000000 6.1 56
Automata 7000000 5.6 670
Autumn in New York 65000000 5.7 135
Avatar 237000000 7.2 11800
[379 rows x 3 columns]>>> # fetch all the movies from "Avatar" till you reach the end
>>> movies.loc["Avatar": ]
budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Avengers: Age of Ultron 280000000 7.3 6767
Awake 86000000 6.3 395
Away We Go 17000000 6.7 189
Ayurveda: Art of Being 300000 5.5 3
... ... ... ...
[REC]² 5600000 6.4 489
eXistenZ 15000000 6.7 475
xXx 70000000 5.8 1424
xXx: State of the Union 60000000 4.7 549
Æon Flux 62000000 5.4 703
[4424 rows x 3 columns]

Selecting two or more not next to each other

Can I get any two or more random movies that are not next to each other? Definitely yes, but you need to put more effort into passing a list of movies you need.

What I meant was you need to have square brackets within a square bracket.

>>> movies.loc[["Avatar", "Avengers: Age of Ultron"]]
budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Avengers: Age of Ultron 280000000 7.3 6767

Can I change the order of selection? Of course, you can help yourself by specifying the list of labels you need in order.

4. Do not get carried away

While this looks cool to specify the list of labels you want to extract, do you know what happens if you spelled a value wrongly? Pandas would have stuck missing Values (NaN) for the wrongly spelled label. But those days are gone, with the latest updates it raises an exception.

>>> movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]
Traceback (most recent call last):
raise KeyError(
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

One way to take care of is by checking the values in the index directly.

>>> "When is Avengers next movie?" in movies.index
False

If you want to ignore the error and move on you can use the below approach.

>>> movies.query("title in ('Avatar','When is Avengers next Movie?')")
budget vote_average vote_count
title
Avatar 237000000 7.2 11800

5. Keeping the promise

Promise Image
Image Credit — Unsplash

Remember my step 2 where I promised to show you how sorting helps. Well if your index is sorted you can take advantage of lexicographical slicing.

  1. Load the data again into pandas.
>>> movies = pd.read_csv("movies_data.csv",
... index_col="title",
... usecols=["title","vote_average","vote_count"])
>>> movies.head(3)
vote_average vote_count
title
Avatar 7.2 11800
Pirates of the Caribbean: At World's End 6.9 4500
The Avengers 7.4 11776

2. Try selecting movie titles between Aa and Bz (remember slicing usually requires from and to ). But wait, I thought I need to know the entire label name to search.

Not really this is where I will show you the advantage with and without sorting.

>>> movies.loc["Aa":"Bb"]
Traceback (most recent call last):
raise ValueError("index must be monotonic increasing or decreasing")
ValueError: index must be monotonic increasing or decreasing

3. Sort the index in ascending order and try the same command.

>>> movies.sort_index(inplace=True)
>>> movies.index.is_monotonic
True
>>> movies.index.is_monotonic_increasing
True
>>> movies.index.is_monotonic_decreasing
False
>>> movies.loc["Aa":"Bb"]
vote_average vote_count
title
Abandon 4.6 45
Abandoned 5.8 27
Abduction 5.6 961
Aberdeen 7.0 6
About Last Night 6.0 210
... ... ...
Battle for the Planet of the Apes 5.5 215
Battle of the Year 5.9 88
Battle: Los Angeles 5.5 1448
Battlefield Earth 3.0 255
Battleship 5.5 2114
[292 rows x 2 columns]

The above commands are self-explanatory, sort the index and check if is increasing order (smallest to largest) or the other way.

4. One more, lets now change the order of sorting and try the same command and see what happens’.

>>> movies.sort_index(inplace=True, ascending=False)
>>> movies.index.is_monotonic
False
>>> movies.index.is_monotonic_decreasing
True
>>> movies.index.is_monotonic_increasing
False
>>> movies.head(3)
vote_average vote_count
title
Æon Flux 5.4 703
xXx: State of the Union 4.7 549
xXx 5.8 1424

The index is sorted in decreasing order. Now let us run our commands.

>>> movies.loc["Aa":"Bb"]
Empty DataFrame
Columns: [vote_average, vote_count]
Index: []

This is a no brainer to see the empty DataFrame as the data is sorted in reverse order. Now let me reverse the order and see the results.

>>> movies.loc["Bb":"Aa"]
vote_average vote_count
title
Battleship 5.5 2114
Battlefield Earth 3.0 255
Battle: Los Angeles 5.5 1448
Battle of the Year 5.9 88
Battle for the Planet of the Apes 5.5 215
... ... ...
About Last Night 6.0 210
Aberdeen 7.0 6
Abduction 5.6 961
Abandoned 5.8 27
Abandon 4.6 45
[292 rows x 2 columns]

I recently made a mistake in production by not understanding the loc concept thoroughly, so thought it is worth sharing the in-depth details of this attribute.

That’s it for me today. If you liked this article and if it helped you in any way, feel free to like it.

If you believe this article will be of big help to someone, feel free to share.

--

--

Sasanka C
The Startup

Data Engineering specialist at Major bank in Australia, Melbourne. Founder, author of thetechfirewall.com.