3 ways to filter Pandas DataFrame by column values
Some flexible approaches to combine multiple filters
Looking for some good book recommendations in Goodreads — I found this quote,
In the Information Age, the first step to sanity is FILTERING. Filter the information: extract for knowledge. — Marc Stiegler
I wondered how it applies to Data Science! This quote fits perfectly to the most significant and the most underrated step in the entire Data Science process — Data Preprocessing! Data Scientists enjoy building models so much that they overlook this process. In actual essence, this process can be intriguing. As mentioned in the quote above, filtering is knowledgable. Filtering data can really guarantee some sanity when you are stumbled upon which variables to fit on the model. In this article, I’ll share some quick ways of filtering data using Pandas.
To jump in, I have obtained this dataset from Kaggle. The dataset contains some sample sales data and looks like the one below.
A simple look into the data and what it has to say:
Looks good! Let’s dive in.
1) Filtering based on one condition:
There is a DEALSIZE
column in this dataset which is either small
or medium
or large
Let’s say we want to know the details of all the large deals. A simple way would be,
Filtering is pretty candid here. You pick the column and match it with the value you want. A common confusion when it comes to filtering in Pandas is the use of conditional operators. Python syntax creates trouble for many.
Note: In Pandas, and
is replaced with &
, or
is replaced with |
and not
is replaced with ~
I find out that Madrid is the top-ranking city in terms of revenue. I’d like to compare the sales details of Madrid against all the other cities. This can be achieved by assigning conditions to variables.
Assigning conditions to variables can be a good practice since it creates a mask that can be applied whenever we want to access data under such conditions.
2) Filtering based on multiple conditions:
Let’s see if we can find all the countries where the order is on hold
in the year 2005
Inside .loc
, the condition within the parentheses evaluates to a boolean value which is then applied upon the column specified.
The data returned from multiple filters depends on the operation performed. When &
and |
operations are performed without an assignment, a series is returned. When the |
operation is performed with an assignment, it modifies the DataFrame.
The list of conditions to be performed upon the DataFrame can increase drastically. Let’s consider a use case. I find out that Spain ranks second in generating total revenue, see if there any orders in Spain where the Sales
didn’t cross 5000 and the Quantityordered
is less than 50. This can be done in two ways:
- Either hard-code the list
- Or build a list that is dynamically evaluated based on the criteria
3) Implementing the If then Constructs:
When solving problems, a format such as this is most common:
Consider a case where a new column called Income Statement
is created that contains three categories — if sales
is greater than 10000 then it’s considered gain
, if the sales range is between 5000–10000 then it’s considered as no change
and anything less than that is a loss
Let’s look at a few ways of implementing this:
- Define a function that executes this logic and apply that to all columns in a DataFrame
- Using a lambda function
- Implementing a loop can be faster than .apply
These three are more pandas-y ways of arriving at the solution. There are many other alternatives to arrive at the solution. Feel free to check my GitHub repo here that contains all the code mentioned in the article!
Thank you for reading all the way down here. Let me know in the comment section if you have any concerns, feedback, or criticism. Have a good day!