Using Basic Excel Functions in Pandas to Analyze Baseball Data

Matt Andrecovich
4 min readMay 17, 2019
no, not this kind of panda!

Coming from a finance background, Excel was my life. Whether it was running analyses at work, combing through stats for my fantasy sports teams, or analyzing my spending habits, I was constantly playing around with that user-friendly, yet extremely powerful tool. When I made the switch to data science, I immediately attempted to replicate my favorite Excel functions in Python, specifically using the Pandas library. Pandas is used to manipulate data in a manner very similar to Excel, operating on dataframes with rows and columns. Here are five common Excel functions that can be replicated in Pandas.

For my data, I will use the current hitting leaderboard for Major League Baseball from fangraphs.com. Before I begin my analysis, I first have to read in the data.

import pandas as pddf = pd.read_csv(‘leaderboard.csv)

1. Sorting

One of the most basic features of Excel is the ability to quickly sort data and analyze the top or bottom values for a given column. You can do this easily in Pandas, too. Let’s sort the data to see what hitters have the most home runs this year.

df.sort_values(by=’HR’,ascending=False)

Doing this prints the whole dataframe, which can be quite cumbersome when working with large data. This is when the “.head()” method comes in handy. It will show the first five rows in the dataset. You can change the number of rows to any number you want e.g. df.head(10) to show the first ten rows. The “ascending = False” portion of the code tells Pandas to sort the values from largest to smallest.

df.sort_values(by='HR', ascending=False).head(10)
output for the code above

2. Filtering

We can also filter data to display rows that meet certain criteria. For example, we can look at hitters that walk more often than they strike out. This is done by applying a mask that filters the data. You must specify the column, criteria, and if using multiple criteria, ensure the Boolean conditions are enclosed in parentheses.

mask = df[‘BB%’] > df[‘K%’]
df[mask]
Only 4 hitters meet this criteria!

3. Pivot Tables

Pivot Tables are one of the most commonly used tools in Excel. They are a great way to aggregate data and quickly manipulate an analysis. Luckily, this feature is easy to perform in Pandas. Let’s make a pivot table to see the total steals for each team. I’ll also sort our data with the technique used in #1.

df.pivot_table(aggfunc=’sum’,index=’Team’,values=’SB’).sort_values(by=’SB’,ascending=False).head(10)
this is a basic example, but you can add more columns to the table if you want!

4. Creating new columns from existing data

Entering data into Excel is the most basic of features. It is easy to do and completely straightforward for the user. If you want to make a new column, it is as easy as typing into a new cell. In Pandas, it works a little differently. Let’s say we want to calculate the OPS [OBP (or on-base-percentage) Plus SLG (or slugging percentage)] for each player.

df['OPS'] = df['OBP'] + df['SLG']df[['Name','OPS']].head()
pretty cool, huh?

5. Summary Statistics

If we want to get a basic summary for a set of data, it would be a rather annoying process of creating multiple formulas and manipulating them to create a clean table. In pandas, this is significantly easier.

df.describe()
yes, it’s that easy!

I hope this quick post gives you a little insight into the powerful library of Pandas. Whether you are a novice Excel user or a macro expert, I highly recommend at least exploring Pandas and its capabilities. These are just a few of many examples; there are countless applications of Pandas to perform Excel-like functions. Play around with it on your own to find your favorite!

--

--