7 Pandas Tricks (with applications)

You Need to Know as a Data Scientist

Josh Bernhard
The Startup
10 min readAug 10, 2020

--

Photo by shiyang xu on Unsplash

Pandas is one of the most popular libraries in all of data science. If you are using Python for data science, you no doubt are using pandas. In this article, we will see a few techniques that make pandas wildly popular.

To illustrate how each of the tips work, we will be using the billobard.csv dataset, which you can download from Github here if you’d like to follow along with this post. We will import some libraries and read in the data.

When you are all set, you should see the below by running the code above.

Data header

With that, let’s dive into the tricks.

1. value_counts

If you have a column that holds a categorical variable, one of the best ways to understand the number of each value that appears is to use value_counts. For example, we could look at the genre column to understand how often each genre appears in the dataset.

In  [1]: df['genre'].value_counts()
Out [1]: Rock 137
Country 74
Rap 58
R&B 23
Pop 9
Latin 9
Electronica 4
Gospel 1
Reggae 1
Jazz 1

Additionally, you can see the proportion associated with each genre by using the normalize=True argument:

In  [1]: df['genre'].value_counts(normalize=True)
Out [1]: Country 0.233438
Rap 0.182965
R&B 0.072555
Pop 0.028391
Latin 0.028391
Electronica 0.012618
Gospel 0.003155
Reggae 0.003155
Jazz 0.003155
Rock 0.432177

This output also makes for an easy bar plot to follow:

> df['genre'].value_counts().plot(kind="bar");
Simple bar chart of genre

2. apply

The first function we will dive into is the apply function. Using the above dataset, we might want to change the time column to a number rather than the default object, which has individual elements of type str, in order to use it for mathematical operations. The below function takes an individual string time and provides back a decimal valued number of the time in minutes.

Using this function across all the rows can be done with the apply function as shown in the following code snip.

> df['time_mins'] = df['time'].apply(clean_time)

As an extension, imagine we want to create a column that keeps track of whether a song is "Rock" or not, as well as if it is longer or shorter than 3.5 minutes. If a song meets both of these criteria, we could create a column that stores a 1. If it doesn’t meet the criteria, this same column could store a zero.

We can use the apply function again, but on the full row. By using axis = 1, we can apply the function to a full row.

3. query

One of the most powerful pandas methods in all of the land is the query method. This method allows you to pull whatever set of rows you are interested in.

Let’s say you are interested in creating a dataframe with only Rock songs. Using query, we can provide this dataframe with the following code block:

> df.query('genre == "Rock"')

Before we get to the next query, we will need to update the names of the columns in our dataset. We can do this with the following line of code:

> df.columns = [col.replace('.', '_') for col in df.columns]

In general, it is considered best practice to use _ between words in a column name. Additionally, . in column names is considered poor practice.

Now we can go back to the query method. If you want only Rock songs that have a first week ranking in the top 30, we could write the following:

> df.query('genre == "Rock" and x1st_week < 31')

If we want Rock or Country songs in a returned dataframe, we can retrieve only these rows with the following query:

> df.query('genre == "Rock" or genre == "Country"')

This same logic can be used to pull any sort of logical set of rows that you might want to return.

4. datetime

In many languages working with dates can be a huge pain, but pandas makes working with dates a walk in the park.

Photo by John Mccann on Unsplash

By default, it is often the case that a time column will be considered an object column, which happens with the billboard.csv dataset. You can change the two date columns to be read in pandas as dates using the built-in to_datetime as shown below:

> df['date_entered'] = pd.to_datetime(df['date_entered'])
> df['date_peaked'] = pd.to_datetime(df['date_peaked'])

Once a column is of type datetime, we introduce a bunch of new easy to use methods. Let’s take a look of few of them.

If we want to know how many days it takes from entry to the billboard top 100 to reach a song’s peak spot, we can directly subtract dates.

> df['days_to_peak'] = df['date_peaked'] - df['date_entered']

We might want to pull the day, week, or month (and if the original date column has it hour, second, or millisecond values) from this column. We can again use apply to pull any of these values from each date. Below is an example of pulling the month from each date.

> df['days_entered'].apply(lambda dt: dt.month)

You can also pull the day and year in similar ways. We might store all of these as their own columns on the dataframe.

> df['day'] = df['days_entered'].apply(lambda dt: dt.day)
> df['month'] = df['days_entered'].apply(lambda dt: dt.month)
> df['year'] = df['days_entered'].apply(lambda dt: dt.year)

Imagine the case where you want to estimate the best of the fast climbing songs. Consider we define this as that reach their peak spot within 1 month of them landing in the top 100 .

We might start with finding the date range that we are considering for each song. We can find the date one week from the start date using the timedelta import as shown below.

> from time import timedelta
> df['fast_risers_date'] = df['date_entered'] + timedelta(days=30)

Simple! If you wanted to use a different ending date simply change the argument passed to timedelta.

Identifying these rows and adding the identifier to the dataset can be done by comparing this fast_risers_date to the date_peaked.

> df[‘fast_risers’] = df[‘fast_risers_date’] > df[‘date_peaked’]

What if we wanted to extend this to songs where the peak spot was in the top 10? Well in order to do this, we might use some of the tricks covered in the next section.

5. indexing and masks

By fifthperspective on Unsplash

There are two common and “best practice” methods for indexing into a pandas dataframe: loc and iloc.

  • loc indexes by the name of the indices.
  • iloc indexes by the number of the indices.

In order to find the peak spot for any particular song, we might use either of these indexing methods. We need to pull only the weekly columns, and then we want the highest rank, which is actually the min value across each row.

We can pull these rows like this:

> df.iloc[:, 7:-2]

This says pull all the rows with the first :, and the second argument says choose from the seventh to the second last column with the 7:-2. If we didn’t add created columns to the end of the dataframe, you might simply go all the way to the end of the dataframe using 7:.

There are a couple general, useful takeaways from the indexing. By providing no value for the index, we are telling python that we want to go all the way to one end of the dataframe or the other. No value on either end of the : says to go from the very beginning to the very end. This same type of indexing can be done with python lists.

Now to pull the highest rank for each song.

> df.iloc[:, 7:-2].min(axis=1)

The argument axis=1 provides the top rank for each row. Otherwise, the min is provided for each column by default. Now to show the other method of indexing using loc, we can pull the rows where the peak date was within the first month as shown here:

> df.loc[df['fast_risers_date'] > df['date_peaked']]

We can add the next bit of logic concerning those where the top spot was within the top 20. This can also be done using loc.

You might be thinking “I can do this with query". You would be correct. There are often multiple methods to do the same thing.

Another way we might perform this same indexing of columns is by identifying the weekly columns and pulling them from your dataframe:

> week_cols = [col for col in df.columns if col.find('week') > 0]
> df[week_cols]

The find method returns the index of where the string week can be found, but if it doesn’t exist -1 is returned. Then in the following line, you can pass a list of columns to choose only those columns from the dataframe.

6. isnull, dropna, and fillna

Imagine we are interested in how long a song stays in the top 100. In order to measure the number of weeks a song is in the top 100, we can use the isnull method.

There are a number of useful methods for working with missing values in python. To begin with, let’s answer this first question:

> df[week_cols].isnull()

This will return True or False associated with each cell in the dataframe. Then we can obtain the number of True's in the following way, since the boolean’s are treated as 1 when True and 0 with False.

> df[week_cols].isnull().sum(axis=1)

Finally, to calculate and store the number of weeks a song is in the top 100 can be found using the following.

df[week_cols].shape[1] - df[week_cols].isnull().sum(axis=1)

The first portion of this pulls the total number of columns, which is the maximum number of weeks a song may stay in the top 100. Subtracting out the number of weeks with a missing value provides the number of weeks a song was in the top 100.

In looking at the distribution of how many weeks a song stays in the top 100, we see that most songs don’t last longer than 30 weeks.

import matplotlib.pyplot as pltplt.figure(figsize=(15,5));
df['weeks_in_top'].hist(bins=30);
plt.xlabel('number of months in top 100');
plt.ylabel('number of songs');

In order to show another useful function related to missing values, consider we want to look at the path a song takes from hitting the top 100 until it falls out of the top 100.

> plt.figure(figsize=(20,5))
> plt.plot(df[week_cols].iloc[0], label = df['track'].iloc[0]);
> plt.xticks(range(0,66), range(1,67), rotation=20);
> plt.xlabel('Week');
> plt.ylabel('Song Rank');
> plt.legend();

This plot of the first song’s path could be better. The low spot is actually when the song was in its top spot. Additionally, there are a number of weeks where the song wasn’t in the top 100, and we might want to compare to other songs.

In order to flip this chart to have the top of a graph be related to the best ranks, we might take 101 minus each rank. Additionally, we might fill all the missing values with 101 (a rank lower than any value that lands on the chart). Then we can flip the label on the y-axis.

The key element to changing our data is with this line:

> 101 - df.fillna(101)[week_cols]

If you wanted to store the results, you might use df.fillna(101, inplace=True) Then we can update our plot with the following:

> plt.figure(figsize=(20,10))
> plt.plot(101 - df.fillna(101)[week_cols].iloc[0], label = df['track'].iloc[0]);
> plt.xticks(range(0,76), range(1,77), rotation=20);
> plt.yticks(np.arange(0,100), np.arange(1,101)[::-1]);
> plt.xlabel('Week');
> plt.ylabel('Song Rank');
> plt.legend();

This provides a much better representation of how a song traverses a path from start to finish. We can also compare the path of the first five songs in the dataset.

For example, we can see from this that Maria, Maria started higher than any of the other songs. We can see that I knew I loved you had the longest duration on the top 100.

7. groupby

We will end on a strong note here. By far one of my favorite functions to use in pandas is the groupby function. You might want to understand which genre has the highest average rank on the first week.

> df.groupby('genre').mean()['x1st_week']
genre
Country 82.405405
Electronica 84.500000
Gospel 76.000000
Jazz 89.000000
Latin 73.222222
Pop 79.222222
R&B 84.086957
Rap 85.172414
Reggae 72.000000
Rock 76.116788

There are so many ways to use the groupby. I have linked a great article discussing the many ways this method can be used.

The example shown here is just one simple example, but there is so much more that can be done with groupby, which you can see from the article above.

With your new pandas tricks, you are ready to take on any data situation in python.

--

--

Josh Bernhard
The Startup

I communicate in a way that some people like and some don't. I like plaid. The views expressed here are my own.