That Week in Data Science

Ready, Set, Subset

Selecting and viewing data with pandas and python

6 min readAug 10, 2020

Photo by Guilherme Stecanella on Unsplash

Even if you are fairly new to working with the pandas software library for data science, you likely already know how versatile it can be for searching and selecting within datasets. You may also have come upon the need to narrow your view, from a large number of columns or rows to a more targeted subset of data.

This article will take a look at how we use methods built into pandas to easily view data subsets.

Indexers

The .loc and .iloc indexers make it possible to explore subsets of data (in columns, rows, summary groups…) and to quickly adjust and manipulate those subsets and views on a variety of levels. Indexers are often used when cleaning data that has just been read into pandas or to drill-down to level best suited for visual exploration. [We will not delve into plotting in this article, as the number of code examples included should give us more than enough to look at.]

Given an appropriate data source, you can use indexers to thoughtfully compose a single line of Python code to limit your view of a dataframe. Before we take a look at some examples, we can consider the following:

Given a dataframe listing the name, height, weight, active years, first professional year played, last professional year played, jersey number, yearly contract value, and shoe-size for athletes in the NBA since 1975, you may only want to view jersey numbers for players who were active in 2006.
Or, you might identify playgrounds with a swing-set located within one mile of a city’s waterfront, from a dataframe of all the cities public spaces (given that the data includes distance from the waterfront or a way to calculate it).

Obviously, options will depend on your dataset. My point is that indexers provide a lot of opportunity to gain better views over data. One of the most important reasons for their flexibility lies right in front of our eyes.

Image of two square-top cafe tables placed together to make one longer table. — I actually just need a table for two. Image credit.

Some of pandas’s built-in methods enable you to select subsets of data and return the subsets themselves as dataframe objects. That means you can operate on them in the same ways you might on any other dataframe to gain meaningful insight. Let’s see what that looks like, with a few excerpts from a python notebook I built for just this purpose.

Code Examples

I will use a data set from a 2014 fivethirtyeight repository containing data and code for a story on the economics of picking a college major. The ‘recent-grads.csv’ file contains a detailed breakdown of labor force information for those in the workforce that have completed graduate school.

We use the ‘df’ variable to instantiate the csv data as a pandas dataframe, which we read directly into our jupyter notebook (Google Collaboratory, in this case).

First, we’ll get an idea of how many rows and columns we are dealing with:

Since .shape returns a tuple, in this case with 173 in the 0th position and 20 in the 1st position, we can retrieve each of its values by index.

Checking the first few rows using python’s built-in .head() function returns a dataframe object.

These are the first five rows of our ‘df’ dataframe.

Inspecting the data, we see that the dataset starts off with a lot of engineering majors. I am already curious as to how many of the entries are from people who majored in engineering. That is a question we can answer rather easily with Pandas.

Let’s stay in the shallow end of this concept. Just as we viewed the first few rows, using the .head() function, we can view the last few rows using .tail(). Let’s inspect the last two rows, only:

Clearly we are not only dealing with engineers.

We can dip our toes a little deeper and recreate .tail() as follows, using the .iloc indexing method:

Since the indexer returns a dataframe subset, we can also use an indexer to subset even further. For example, if we only want the ‘Major_code’, ‘major’, and ‘Total’ columns of those same, last two rows:

Note: ‘Rank’ is returned as an index and is not actually part of the request.

We used .iloc to slice the dataframe and access all rows from the .tail() function (using blank spaces on either side of the first colon to get every row from the beginning and every row through the end) and columns from 0 up-to-but-not-including 3.

Image of a hand holding a clear bowl with two quarters of melon that have been sliced into segments, resting on their rinds. — Photo by AI FENG Hsiung on Unsplash

Recall that the purpose of this example is to show that you can operate on these built-in functions and methods just as you would any other dataframe.

Granted, these examples are not necessarily simpler. Let’s see what it looks like when we retrieve the same data, without .tail():

Exceptions

Not all built-in functions / methods return dataframe objects. Python’s .info() will return a vertical list of column names along with the object type of each column:

The .info() method returns what is essentially a printout. It is not a dataframe object. If we check its type, it will print out the same info.

At least that time it admits to being ‘NoneType’, so we clearly cannot perform any operations on it.

We learn from .info() that all of the columns in the dataframe are numeric, with the exception of ‘Major_category’. That bring us to our next function, which frankly is the real reason we looked at .info().

.describe()

Python’s .describe() returns basic statistics for a dataframe object.

If you think the output of the describe() function looks a lot like a dataframe, you are correct. And we can access each of its columns with dot-notation:

…Or, we can do what we came here for and access information from the dataframe’s .describe() function, using our indexers.

We retrieved only the 25th percentile statistic from the dataframe’s fifth -row (index number 4), and we found that statistic by requesting columns from the start of the dataframe up-to-but-not-including index 2.

Of course, we can view the same data using .loc:

How’s that for idosynchratic granularity, eh?

Ehem!

If you recall the row labels, you can limit your view to a specific subset of statistics:

And they are returned as dataframe objects; so you can further reduce the views to include specific columns. Below, I skip one column, then two, then one…:

Finally, we conclude our sub-setting exercises with a more meaningful and — without too much dificulty, consumable — view of the minimum, mean, and max statistical counts of workforce employed and unemployed, full-time and part-time jobs, jobs requiring and not requiring college education, low-wage service jobs, and the number of men and women in the dataset.

That’s what you were looking for, right?

Image credit

Conclusion

In this article, we discussed a few ways to subset data in pandas. Slicing dataframes can help analysts focus on particular elements within data. Functions and methods, including .loc and .iloc, are useful for such efforts; they can be operated upon using the same methods as the full dataframe.

Sometimes, recognizing the results of built-in python methods as dataframe objects can give you a sharper view of your data. Other times, it might just give you an appreciation for their understated elegance.

It only matters how you slice it.