Pandas Makes Analysis a Layup: Grouping, Pivoting, and Plotting

Dan Watson
Hardwood Convergence
7 min readJul 24, 2019
This is how easy pandas makes analysis
This is how easy pandas makes grouping and segmenting data!

The Plan

If you haven’t followed along thus far, you can check out how we scraped our data and download our files on GitHub. Today we’re going to dig in to Harden’s stats a bit deeper. Feel free to continue with the same notebook and dataset from the previous tutorial (harden_stats.ipynb), but I’m going to make a new copy of this notebook to make it easier for everyone following along at home. Also, going to try to make these posts shorter and limit the topics that we cover. Today we’ll learn how to:

  • Segment data with groupby()
  • Make basic plots with pandas
  • Advance our segmentation and plotting with pivot_tables()

Getting Win-Loss Stats with Groupby

We’re following the same process to start our sheet today, but notice that we loaded a few more modules. In addition to pandas and os, we imported numpy and pyplot from matplotlib. We haven’t used these yet, so if your import throws an error, just use pip to install the the modules like we did in previous posts.

Importing new modules and loading our dataset

Let’s start with this question: How does Harden’s scoring impact his team’s chances to win? An initial way to answer this question would to be explore the difference in Harden’s scoring average in wins versus losses. We could do this by selecting a subset of our dataset where the Rockets won and then calculate the average of that subset’s points column. We could then repeat the process for road games. Sounds inefficient! Instead, let’s leverage pandas to do that for us. We can use pandas groupby() method to segment the dataset on the win field and return both averages at the same time. Here’s how we’ll do it:

df.groupby(['win'])['pts'].mean().round(1)

With this code, we’re taking the entire dataframe, telling pandas to group it by the win column, selecting the mean, and then returning the value rounded to one decimal place. Here’s the output:

Basic groupby to get Harden’s scoring in wins vs losses
Basic groupby to get Harden’s scoring in wins vs losses

We see that Harden scores almost three points more per game in Rockets wins than in losses. This leads us to think that it’s in the Rockets best interest to let Harden score all the baskets!

But, maybe there is a difference in Harden’s scoring in home vs away wins and losses. Let’s explore that. Now we’re going to segment the data into four segments: home wins, home losses, road wins, and road losses. That’s just as easy to implement in python:

Pandas groupby() using multiple columns
Pandas groupby() using multiple columns

You can see that we passed a list of columns into the groupby function. Now we created a multi-index output that segments by home/road games and wins/losses. We see that Harden’s scoring average is almost the same for road games, regardless of the outcome! The difference in his average is a really stemming from a difference in home wins and home losses. That’s interesting, but it still doesn’t explain the entire story.

We know that the Rockets were one of the league’s best teams last season, winning 51 games in the regular season. It’s possible that the results above are skewed because a relatively low number of losses. Let’s add a count of each segment to the groupby:

Applying multiple functions to groupby
Applying multiple functions to groupby

Here, we have the same groupby as the previous example, but we’re now applying a count function in addition to the mean function. We do this by adding the .agg() method to our code and then a column name along with its aggregate function. These results start painting a better picture about the Rockets. When Harden played, we see that the Rockets went 20–19 on the road last season- not great for a supposed title contender- while they we’re great at home. We see that Harden’s home average differed by 3.7 points, but the Rockets only lost 8 home games when he was playing. It’s definitely plausible that his average is significantly lower due to just a few bad games.

Visually Exploring Harden’s Segmented Scoring

We’ve started segmenting Harden’s scoring, but now we’re not sure his differences in home wins and losses is real or it is just due to a couple of bad games. We could run a statistical test to see if this difference was caused by chance or not, but instead we’ll use pandas to visually inspect Harden’s scoring.

Fair warning, these plots will be underwhelming. But remember, the point here isn’t to publish these plots(we’ll make good looking plots later), the point is to quickly get insights from our data! Let’s start with a plot of Harden’s scoring in home vs away games.

Scatter plot of Harden’s scoring in home vs away games
Scatter plot of Harden’s scoring in home vs away games

All we need to do to create a scatter plot in pandas is choose our columns, here we’ll take our home_game and pts columns, call the plot method, set the plot kind to scatter, and identify our x and y variables. We see that Harden had a more poor road games than home games from a scoring perspective, based on more values that were significantly under his 36 pt average. While his 50+ point games seemed similarly distributed between home and away, we notice that he had significantly more games in the high 40s at home vs on the road. While this may be interesting, let’s add some color to shed light on how Harden performed in wins and losses:

Scatter of home game vs points with win-loss indicated

Here, we just added the win column into our plot data and assigned it to the color variable, ‘c’. We then just added a color map to select the specific rgb color values for us. You can find more color map options here. The sharex variable being set to false allows the x axes values and the color map to be shown at the same time- rerun the code without this to see what i mean.

Looking at this plot, we notice that the Rockets only lost two home games when Harden scored at least 30 points. Maybe his home scoring matters!? However, when he scored 30+ poitns on the road, his team wasn’t nearly as successful. Maybe it doesn’t have to do with Harden? We would want to look into how his teammates perform on the road to see get a clearer picture of what actually happened with this team.

Using Pivot Tables to Improve Segmenting and Plotting

Before we end this post, we need to quickly discuss creating pivot tables in pandas. Suppose we want to make a side-by-side bar chart of Harden’s home and away scoring in pandas. If we use groupby and plot our results we get something like this:

Extremely ugly plot

Well that’s ugly and confusing. It’s hard to read the graph and we don’t utilizer color to visually distinguish between the categories. The plot definitely doesn’t intuitively tell a story.

The issues is our groupby creates two indexes on our rows. One way to fix this would be to save two different variables for average points in home and away games segmented on wins and losses. We could then put these into a dataframe and plot it. The code would look something like this:

Inefficient way to pivot data. Don’t do this!
Inefficient way to pivot data. Don’t do this!

I’m sure by now you’re thinking, “There has to be an easier way to do this!” You are correct. Let’s use pivot_table() as follows:

That’s better! Thank you pivot_table()!

You can see how we just reduced three lines of code into just one and made our code much easier to understand. Using the pivot table method, we have our index set to the home_game column and we’re using the values from the win column as our columns. We then pass the value we want to segment, which is points and set an aggregate function- we’ll use numpy’s mean function. Now let’s see what happens when we plot our data in our new structure:

Much better bar plot!
Much better bar plot!

Now when we call our basic bar plot, we get a color coded side-by-side bar chart! Now we can easily see the differences in groupings and we have a color coded legend. The plot is not perfect, but it now draws in our attention and is no longer confusing!

Wrapping Up and What’s Next

In just a few posts we’ve gone from scraping data from the web to cleaning csvs, calculating basic stats, and engineering features, to segmenting data, making plots and formulating hypotheses about what’s impacting NBA games!

We’ve laid a good foundation and next we’ll dive back into this dataset and learn to manipulate dates. After that we’ll learn more about visualziations with matplotlib before we hop into interactive plotting. Then we’re done with Harden’s dataset. We’ll have enough of the basics down to be dangerous and we’ll start learning new ways to get more data.

As always, you can find all the code used today on the GitHub repo. Please check it out, play with the code, and send over any questions or comments you may have!

Appreciate you all for checking this out and following along. I’ll keep updating this publication regularly and sharing my love for basketball and programming. If you’ve learned anything, share with a friend and smash those claps so I know you’ve been here!

👏👏👏

--

--

Dan Watson
Hardwood Convergence

Data nerd, basketball fanatic, and ice cream connoisseur. Health care analytics by day, basketball analytics by night. https://www.linkedin.com/in/danielkwatson