Row, Row, Not This Row, Your Boat? Selecting & Filtering Data in Pandas

Chris Bruehl
Learning Data
9 min readJun 22, 2023

--

Photo by Shengjun Shi on Unsplash

Selecting the right rows and columns in your dataset is a critical task for just about any data analysis. Of course, because of its importance, there are a number of ways to do this in the Pandas library in Python. In this article I’ll walk through the selection methods you need to know and provide my thoughts on which are best for the given task. Let’s start simple.

DataFrame Indices

DataFrames have an index for rows and an index for columns. In most cases, when we first read in data, our row index is the default, integer based index, which starts at 0 and increments by 1 for each row.

The column index is the names of columns. Under the hood, each of our columns is a Pandas Series (a fancy NumPy array) stitched together with a common row index.

To reference the entire row based index, we can specify df.index. In this example, our DataFrame is stored in the variable ‘cars’.

This indicates our default index, which starts at 0, ends at 5 (not inclusive), and increments by 1 for each row.

For columns, we can specify df.columns to return our column index, which is the list of column names:

Understanding your DataFrame’s indices is critical to making data selection intuitive.

Bracket Notation for Selecting Columns & Rows

The most simple method for selecting columns is via bracket notation. To select a single column, we can pass reference the dataframe name, followed by brackets containing the column name we wish to select.

Below, we select the ‘body’ column by passing the name into brackets. Note that using single brackets returns output as a Series.

Pass in double brackets to return a DataFrame. This seems quirky until we take a look at multiple column selection:

What’s really happening with double brackets is we’re passing a list to the column selectors. So Pandas is preparing to selecting multiple series, or columns, and a multi-series object in Pandas is a DataFrame. Below, we select three columns.

Because the input into column selection is a list, we can even get fancy and use a list comprehension to select columns logically based on their names.

Below, we’re using a list comprehension to select only columns where the first character (index 0) is ‘c’. While not the most practical use case, we can use similar logic to select groups of columns with a common suffix, prefix, etc.

To select rows, rather than columns. We need to supply a slice. Single values are reserved for columns, but adding a colon as below will allow us to slice the rows we need. Here we’re just grabbing the first row by slicing from 0 to 1 (not inclusive).

By increasing the endpoint of our slice by 1 to 2, we grab the first two rows.

Beyond simple slicing, we tend to use logical conditions to subset rows, e.g. “make” == ‘Kia’ if we only want to look at Kia cars, and so on. We’ll explore those shortly, but first let’s look at one more clever way to grab columns using some helpful builtin functions.

The select_dtype() method allows us to select columns in our DataFrame matching one ore more provided datatypes:

Here we’ve selected our numeric columns by passing “float” and “integer” into the list of datatypes to retrieve.

Logical Filters

More often than not, we tend to select columns specifically, that is to say we only need columns ‘a’, ‘b’, and ‘c’ for our analysis, so we’ll grab those and ignore the rest. With rows, we tend to want to select them logically. Pandas has a few mechanisms for doing this. The most common (not my favorite, but necessary to know) is using a Boolean condition within brackets to subset your data.

Above, we’re passing a Boolean filter into brackets. This particular one is saying, select rows from the car DataFrame where the value in the “make” column equals “BMW”. Under the hood, Pandas is grabbing row indexes where the condition is True. We can see what this looks like by removing the outer brackets and DataFrame reference:

So, when we add back the outer brackets, Pandas knows to return index positions 2 and 4 because these returned True in our logical test.

To compare a column to multiple values, we can use the .isin() Series operator.

Multiple conditions can be also passed into brackets. For more than one condition I almost always use the .query() method, but again, not everyone shares my love or knowledge of that method.

Below, we’re using two logical conditions to grab a specific make and model. I like to store multiple conditions in separate variable as they can get really challenging to read.

We must wrap each individual condition with parentheses and use either ‘&’ for ‘and’, or ‘|’ for or. This will allow Pandas to retrieve the indices of rows that match the conditions provided.

.loc[] & .iloc[]

The .loc[] and .iloc[] methods are concise alternatives to the above, and allow for simultaneous column and row selection. Leveraging these methods as an alternative to standard bracket based filtering tends to be ideal, because they allow for cleaner chaining, tackle both columns and/or rows, and there are some edge cases (if your indices themselves have Boolean labels) where this method will fail.

First, let’s look at iloc[], which operates on the position of the rows/columns regardless of their labels. To select just rows and all columns, we only need to supply a single argument. This can be a slice or a list. Below we’re grabbing the first three rows of our DataFrame. In iloc[] the endpoints of your slices is not inclusive.

To select only columns, we need to pass a colon into the rows argument, indicating we’ll select all rows, add a comma, then pass the positions of the columns we wish to select. Below we’re grabbing all rows, and columns in positions 1 and 3. Remember Python is 0 index, so intuitively these correspond to the second and fourth columns in the DataFrame.

And of course, bringing it all together, we can select rows and columns simultaneously with a single call to iloc.

The .loc[] method operates on labels rather than position. In many DataFrames, our row index is the default positional integer index, so it’s very similar to iloc[] for rows in these cases. But it makes column selection based on name possible. One major difference between loc and iloc is that the end of slices in iloc is NOT included, while the endpoint in loc IS included.

Below we can see that our :3 slice grabbed one additional row with .loc. We grabbed the same columns using their names which is generally much easier for us to remember than thinking positionally.

As with .iloc, to select all rows but only columns with .loc[], pass in a naked colon to the row portion of loc.

Logical filters can be passed into the row portion of loc. Let’s take a quick look.

These two methods are my go to data selection methods in Pandas, particularly if I’m selecting more than a single column/row. Getting into the habit of using these rather than bracket notation is a best practice.

.query()

The query method is one of Pandas best kept secrets. Ok, so it’s not exactly a secret but many courses don’t go beyond the above selection methods and thus many Pandas users don’t know what they’re missing.

Query allows us to use more human friendly syntax to filter DataFrames. Let’s look at a basic example below.

All we need to do is call the query method, open up some quotes, and pass in the filter we want to perform. Admittedly, a single condition doesn’t look much better from our filtering logic above. Let’s look at two conditions:

Here we’ve used a standard Python membership test to check whether the car make was either BMW or Kia, and also filtered down to cars that have a value of 10000 or less for odometer. We can use the words ‘and’ and ‘or’ rather than ‘&’ and ‘|’. The more conditions we add, the more readable query becomes relative to traditional boolean indexing.

One other very nice feature of .query() is the ability to reference external variables using the @ symbol.

This yields the same result but makes it much easier to read and reference as the length of our list increases.

One thing that query does not do for us is column selection. But we can easily combine query with .loc[] or bracket notation. Here, I’m going to query first then select columns, so I can use all columns to filter before reducing down to the data I need.

Note that opening up parentheses allows us to place each operation on its own line, which helps manage line length and makes it easy to understand what is happening in sequence to our DataFrame.

Recommendations

In general, all of the methods discussed here are functional ways to select columns and rows. In my opinion, .loc[], .iloc[] should be used rather than bracket notation any time you’re selecting more than a single column or row, and often times even if you are. .query() is a superior method for filtering rows thanks to its readability. Its benefits also increase as the complexity of your filtering conditions increase. Whether you take my recommendations or not, I hope this article makes selecting and filtering data just a bit easier the next time you do it!

If you liked this, we’d love for you to subscribe to our publication.

And drop us a comment letting us know what you’d like to see next. You can also submit your own writing for us to promote, so don’t be shy if you would like to become a contributor. Check out submission instructions here.

Happy learning!

--

--