DataFrames with Pandas: Part 2(Filtering)

Md Aminul Islam
4 min readMay 13, 2023

--

In our previous section, we have been introduced with DataFrame and Series object of Pandas package. We have seen how we can load data into a DataFrame from a CSV file. We have also learnt how we can slice across rows and columns using loc and iloc properties of DataFrame. If you do not read the previous article, I strongly recommend to go through the Part-1 before proceeding this one. Today, we will see how we can filter rows, and some other useful methods of DataFrame and Series.

Filtering Rows

Data Scientists sometimes need a subset of the whole DataFrame based on some conditions. For this, they need to filter out the DataFrame using the required conditions.

Let’s at first load our elections DataFrame on which we have worked in our previous sections.

import pandas as pd

elections = pd.read_csv("elections.csv")
elections

Let’s say, now we only want to filter out the winners from the above elections DataFrame. For this, we need to check (i) if the value of column Result is equal to win and (ii) keep only those rows.

The below code will compare if the column Result values equal to win.

elections['Result'] == 'win'

The above comparison returns a Series object of boolean values which is actually equivalent to a boolean array. Now, we need to keep those rows which evaluate to True .

elections.loc[elections['Result'] == 'win', :]

We can also write the above filtering expression using shorthand expression. The output will be same as the above output.

elections[elections['Result'] == 'win']

We can also filter rows based on multiple criteria. Let’s say, we want to filter out winners either from Democratic party or from Republican party before 2000. The expression will look like this:

elections[(elections['Year'] < 2000) & 
(elections['Result'] == 'win') &
((elections['Party'] == 'Democratic') |
(elections['Party'] == 'Republican'))]

Important DataFrame and Series Methods:

For sorting the DataFrame based on a column, sort_values method is used.

elections.sort_values(by='%', ascending=False)

We can also sort based on a user defined function. For this, we need to provide the key parameter inside the sort_values method. If we want to sort the DataFrame based on candidate names length, then the code will look like this:

def name_len(name):
return name.str.len()

elections.sort_values(by='Candidate',
key=name_len,
ascending=True)

We can also use sort_values method on a Series, which returns a copy with the values in order.

elections['Popular vote'].sort_values(ascending=False)

Series has another function value_counts which returns a new Series that contains the counts of every value.

elections['Party'].value_counts()

Series also has a function uniquewhich returns all unique values of a Series as an array.

elections['Party'].unique()

Now, let’s say we want to filter out rows only for Democratic, Republican, and Whig parties. A typical filtering expression will look like this:

elections[(elections['Party'] == 'Democratic') |
(elections['Party'] == 'Republican') |
(elections['Party'] == 'Whig')]

But we can use DataFrame isin method which wraps all the provided party names into an array. The following query will look more nicer than the previous one.

elections[elections['Party'].isin(['Democratic', 'Republican', 'Whig'])]

Today, we have been introduced how we can filter rows, and some other useful methods of DataFrame and Series like sort_values , value_counts, unique , isin . You can see this GitHub repository where all the examples and dataset mentioned here are available. In our next part, we will see how we can aggregate rows in a DataFrame. You can read the part 3 from here. Please follow or subscribe myself if you want to be notified for future posts related to DataScience, Machine Learning, and Casual Inference.

--

--

Md Aminul Islam

I am a PhD student in Computer Science at University of Illinois Chicago. My interests are in reasoning with data using Data Science and Machine Learning.