Using Basic Excel Functions in Pandas to Analyze Baseball Data
Coming from a finance background, Excel was my life. Whether it was running analyses at work, combing through stats for my fantasy sports teams, or analyzing my spending habits, I was constantly playing around with that user-friendly, yet extremely powerful tool. When I made the switch to data science, I immediately attempted to replicate my favorite Excel functions in Python, specifically using the Pandas library. Pandas is used to manipulate data in a manner very similar to Excel, operating on dataframes with rows and columns. Here are five common Excel functions that can be replicated in Pandas.
For my data, I will use the current hitting leaderboard for Major League Baseball from fangraphs.com. Before I begin my analysis, I first have to read in the data.
import pandas as pddf = pd.read_csv(‘leaderboard.csv)
1. Sorting
One of the most basic features of Excel is the ability to quickly sort data and analyze the top or bottom values for a given column. You can do this easily in Pandas, too. Let’s sort the data to see what hitters have the most home runs this year.
df.sort_values(by=’HR’,ascending=False)
Doing this prints the whole dataframe, which can be quite cumbersome when working with large data. This is when the “.head()” method comes in handy. It will show the first five rows in the dataset. You can change the number of rows to any number you want e.g. df.head(10) to show the first ten rows. The “ascending = False” portion of the code tells Pandas to sort the values from largest to smallest.
df.sort_values(by='HR', ascending=False).head(10)
2. Filtering
We can also filter data to display rows that meet certain criteria. For example, we can look at hitters that walk more often than they strike out. This is done by applying a mask that filters the data. You must specify the column, criteria, and if using multiple criteria, ensure the Boolean conditions are enclosed in parentheses.
mask = df[‘BB%’] > df[‘K%’]
df[mask]
3. Pivot Tables
Pivot Tables are one of the most commonly used tools in Excel. They are a great way to aggregate data and quickly manipulate an analysis. Luckily, this feature is easy to perform in Pandas. Let’s make a pivot table to see the total steals for each team. I’ll also sort our data with the technique used in #1.
df.pivot_table(aggfunc=’sum’,index=’Team’,values=’SB’).sort_values(by=’SB’,ascending=False).head(10)
4. Creating new columns from existing data
Entering data into Excel is the most basic of features. It is easy to do and completely straightforward for the user. If you want to make a new column, it is as easy as typing into a new cell. In Pandas, it works a little differently. Let’s say we want to calculate the OPS [OBP (or on-base-percentage) Plus SLG (or slugging percentage)] for each player.
df['OPS'] = df['OBP'] + df['SLG']df[['Name','OPS']].head()
5. Summary Statistics
If we want to get a basic summary for a set of data, it would be a rather annoying process of creating multiple formulas and manipulating them to create a clean table. In pandas, this is significantly easier.
df.describe()
I hope this quick post gives you a little insight into the powerful library of Pandas. Whether you are a novice Excel user or a macro expert, I highly recommend at least exploring Pandas and its capabilities. These are just a few of many examples; there are countless applications of Pandas to perform Excel-like functions. Play around with it on your own to find your favorite!