Reproducing the Baseball Salary visualization from FiveThirtyEight in Matplotlib

Published in

DataExplorations

9 min readOct 12, 2018

I was intrigued by this visualization on FiveThirtyEight and wanted to try to recreate it in Python using Matplotlib

This diagram, at its core, is a scatter plot of standardized salary for each team against their win/loss percentage for the years 1985–2015.

Salaries were standardized by calculating the mean league spending for each year and then calculating how many standard deviations each team’s total salaries were away from that mean.
each season is one dot in the figure
the colored line in each chart represents the regression/best fit showing the overall trend for that team

This diagram is intended to investigate whether higher spending does indeed lead to better performance

1. Prepare the Data

The first step was to gather and load the Salary history data. All the data used in this analysis was taken from Kaggle

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd%matplotlib inline# first load salary data (taken from https://www.kaggle.com/open-source-sports/baseball-databank#Salaries.csv)
df=pd.read_csv('./data/Salaries.csv')df.head()

This data set includes each individual player’s salary but we just want to get the total Salary costs for each team /year. So let’s do a groupby to get the sum of the salary for each team:

df=df[['yearID','teamID','lgID','salary']].groupby(['yearID','teamID','lgID']).sum().reset_index()
df.head()

Now we need to calculate how many standard deviations each team’s Salary spending is away from the league average for that year

I defined a function to calculate this value and then called it for each row in the data frame using a lambda expression. I had a bit of trouble getting it to accept two values from the data frame (year and salary) but eventually resolved it by adding an axis=1 argument (thanks to kaijento for the hint). I then saved this value back as a new column in the data frame, StandardizedSalary

Note: This likely is not the most efficient way to calculate this value and extend the data frame, but my focus was on generating data quickly to get to the visualization.

''' This function calculates how many standard deviations a teams salary lies from the mean for a given year and returns that value'''
def calculate_std_salary(df,year,salary):
    return (salary-(df[(df['yearID']==year)]['salary'].mean()))/(df[(df['yearID']==year)]['salary'].std())# call our function and add as a new column in our df
df['StandardizedSalary'] = df.apply(lambda row: calculate_std_salary(df, row.yearID, row.salary), axis=1 )
df.head()

2. Now need to get and cleanup winning percentage for each team /year

We’ll bring in information from the Teams.csv file, also found on Kaggle. This dataset has a lot of extraneous information for our purposes, so let’s get rid of all the columns except the ones of interest

G: Games play
W: Games won
L: Games lost

This file has data going back to the 1800s (!) so we’ll need to filter it down to only include years > 1985. And then, finally, we’ll calculate each team’s winning percentage for each year and add that back into our DataFrame as a new column (winning_perc)

df_teams=pd.read_csv('./data/Teams.csv')
df_teams = df_teams[['yearID','lgID', 'teamID','G','W','L']] # first limit cols# in addition, this dataset goes back to 1871, so let's filter for years > 1985 to match our Salary data
df_teams=df_teams[(df_teams['yearID'] >= 1985)].reset_index().drop('index',axis=1)# now calculate winning percentage for each team/year and add as a new column
df_teams['winning_perc'] = df_teams['W']/df_teams['G']
df_teams.head()

Now let’s merge the two datasets together into one data frame that we can use for our visualization

df_merged = df.merge(df_teams, on=['yearID','lgID','teamID'])
df_merged.head()

3. Starting small with our visualization — build up our plot for one team

To start with, let’s build up a plot for just one team… and, of course, we’ll start with the Blue Jays!

Let’s create a small data frame with just Blue Jays info to play around with and then create a simple scatter plot.

Use alpha=0.5 to “mute” the colour of the dots
To align the TOR title at the left: plt.title('TOR', position=(0,1), ha='left')

# start with a basic scatterplot for ONE team.. Blue Jays
blue_jays = df_merged[(df_merged['teamID']== 'TOR')]
# create a basic scatter plot
plt.scatter(x=blue_jays['StandardizedSalary'], y=blue_jays['winning_perc'],alpha=0.5)
plt.title('TOR', position = (0,1), ha = 'left', fontsize=16)
plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey') 
plt.ylabel('Winning Percentage', position = (0, 1), ha = 'right', color = 'grey')

Not bad… now let’s add in the cross hairs and try to center graph

#plot a horizontal line from x=-2 to 2 at y=0.5
plt.hlines(0.5,-2, 2) 
# plot a vertical line from Y=.4-.6 at X=0
plt.vlines(0,.4,.6)

This created our cross hairs and, as a nice side effect, also centered the graph for us at 0, 0.5

Now let’s get rid of the border. To do this, we need access to the axes object, so we have to slightly change how we invoke the figure. We can create a subplot and store our axes in the ax variable: fig, ax = plt.subplots(figsize=(5,5)) . This allows us to call ax.spines to suppress the borders

# change way we invoke plot so can manipulate the axis
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(x=blue_jays['StandardizedSalary'], y=blue_jays['winning_perc'],alpha=0.5)
plt.title('TOR', position = (0,1), ha = 'left', fontsize=16)
plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey') 
plt.ylabel('Winning Percentage', position = (0, 1), ha = 'right', color = 'grey')
# create cross hairs 
plt.hlines(0.5,-2, 2)
plt.vlines(0,.3,.65)# Now we can remove the top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

Looking good… how about some gridlines? Again, we can use our ax object

# add in gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)

Finally, let’s add in a simple regression line using Polyfit.

b- = blue solid line
linewidth=3 : adjust the linewidth to get a thicker line
c='#1f77b4' : adjust blue color to match python default blue for Blue Jays

# add in Line of Best fit for Blue Jays
z = np.polyfit(blue_jays['StandardizedSalary'], blue_jays['winning_perc'], 1)
p = np.poly1d(z)

plt.plot(blue_jays['StandardizedSalary'], p(blue_jays['StandardizedSalary']), 'b-',alpha=.8,c='#1f77b4',linewidth =3)

4. Let’s turn this into a function we can call to create a plot for a specific team

This function takes two arguments:

team_df: a filtered dataframe that only contains data for the desired team
team_color: the color to use when generating the plot. Defaults to blue if not provided

def plot_team(team_df,team_color='#1f77b4'):
    ''' Generate a plot for the specified team
    team_df: a filtered dataframe that only contains data for the desired team
    team_color: color to use when generating plot.  defaults to blue (as for Blue Jays above)
    '''
    fig, ax = plt.subplots(figsize=(5, 5))# add in Line of Best fit for Team
    z = np.polyfit(team_df['StandardizedSalary'], team_df['winning_perc'], 1)
    p = np.poly1d(z)
    # b- = blue solid line
    # adjust linewidth to get a thicker line
    # adjust blue color to match python default blue for Blue Jays
    plt.plot(team_df['StandardizedSalary'], p(team_df['StandardizedSalary']), 'b-',alpha=.8,c=team_color,linewidth =3)ax.scatter(x=team_df['StandardizedSalary'], y=team_df['winning_perc'],alpha=0.5,c=team_color )
    plt.title(f"{team_df['teamID'].values[0]}",  loc = 'center', fontsize=16)
    plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey')
    plt.ylabel('Winning Percentage', position = (0, 1), ha = 'right', color = 'grey')
    # create cross hairs 
    plt.hlines(0.5,-2.5, 2.5)
    plt.vlines(0,.3,.65)# reduce ticks on x axis so it's cleaner
    plt.xticks([-2,0,2])#removing top and right borders
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)# add in gridlines
    ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)

Test the function for Atlanta, using the colour Green:

plot_team(df_merged[(df_merged['teamID']=='ATL')],'g')

5. Plot all the teams in the American League East

First let’s create a dictionary with the team names and a colour to use for plotting that team and then loop through each team to generate their plot

al_east_team_dict={
    'BAL':'#FF6F1C', 
    'BOS':'#FFB627',
    'NYA':'#223843', 
    'TOR':'#1f77b4', 
    'FLO':'#FFCAD4'
}
for team,color in al_east_team_dict.items():
    plot_team(df_merged[(df_merged['teamID']==team)],color)

Ok… it’s a start, but I don’t want the charts all on top of one another. I want them organized in rows. We’re going to have to change our function slightly to accomplish this

Here is our function that we’ll be calling

def plot_team_row(team_df, ax,team_color='#1f77b4'):
    '''Creates a subplot for one team
    team_df: a filtered dataframe that only contains data for the desired team
    ax: the axes object to use for this plot.  Lets us target a specific subplot
    team_color: color to use when generating plot.  defaults to blue (as for Blue Jays above)
    '''# add in Line of Best fit for team
    z = np.polyfit(team_df['StandardizedSalary'], team_df['winning_perc'], 1)
    p = np.poly1d(z)
    # b- = blue solid line
    # adjust linewidth to get a thicker line
    # adjust blue color to match python default blue for team
    ax.plot(team_df['StandardizedSalary'], p(team_df['StandardizedSalary']), 'b-',alpha=.8,c=team_color,linewidth =3)ax.scatter(x=team_df['StandardizedSalary'], y=team_df['winning_perc'],alpha=0.5,c=team_color )
    # note change in how title is called
    ax.set_title(f"{team_df['teamID'].values[0]}")
 
    # suppress x/y labels so we can put one label on overall plot 
   # plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey') # getting error
   # plt.ylabel('Winning Percentage', position = (0, 1), ha = 'right', color = 'grey')# create cross hairs - note this also centered plot for us
    ax.hlines(0.5,-2.5, 2.5)
    ax.vlines(0,.3,.65)# reduce ticks on x axis so it's cleaner
    plt.xticks([-2,0,2])#removing top and right borders
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)# add in gridlines
    ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)

This is the code to call it

Before calling this function, we have to instantiate the SubPlot and tell it how many rows and columns to create. I want 1 row with 5 columns (1 for each time) which can be done by setting nrows-1, ncols-1. I also want the plots to share the same X and Y axis so that they can be easily compared: sharex=True, sharey=True.
The ax object returned by plt.subplots is an array with each value defining one subplot. These values can be passed into to the plot_team_row() function to “fill in” that plot with one team

# create a plot that's 1 row and 5 columns wide
# sharex=True, sharey=True: these plots will share the same x and y axis
fig, ax = plt.subplots(nrows=1, ncols=5,figsize=(15,5),sharex=True, sharey=True)
# set an overall title for the plot
fig.suptitle('American League East', ha='center',color = 'grey',fontsize=20,va='top',) 
# help the subplots fit better
fig.subplots_adjust(hspace = .5, wspace=.001)pos=0 # determines which axes object we will pass into the function to be filled by the team's plot
for team,color in al_east_team_dict.items():
    plot_team_row(df_merged[(df_merged['teamID']==team)], ax[pos],color)
    pos=pos+1# force label to appear on first plot rather than the last plot
ax[0].set(xlabel='Standardized Salaries', ylabel='Winning Percentage')

Here are the charts for the American League East, all nicely plotted in a row

6. Put a bow on it and generate a chart for the whole league

First generate all the necessary dictionaries with info about each Division and Team and then store those in a list that we can loop through

# will need dictionaries for all leagues
al_central_team_dict={
    'CLE':'#FF6F1C', 
    'MIN':'#FFB627',
    'DET':'#223843', 
    'CHA':'#1f77b4', 
    'KCA':'#FFCAD4'
}
al_west_team_dict={
    'HOU':'#FF6F1C', 
    'OAK':'#FFB627',
    'SEA':'#223843', 
    'LAN':'#1f77b4', 
    'TEX':'#FFCAD4'
}nl_central_team_dict={
    'ML4':'#FF6F1C', 
    'CHN':'#FFB627',
    'SLN':'#223843', 
    'PIT':'#1f77b4', 
    'CIN':'#FFCAD4'
}
nl_west_team_dict={
    'LAN':'#FF6F1C', 
    'COL':'#FFB627',
    'ARI':'#223843', 
    'SFN':'#1f77b4', 
    'SDN':'#FFCAD4'
}
nl_east_team_dict={
    'ATL':'#FF6F1C', 
    'WAS':'#FFB627',
    'PHI':'#223843', 
    'NYN':'#1f77b4', 
    'MIA':'#FFCAD4'
}league_list = [al_east_team_dict, al_central_team_dict,al_west_team_dict, nl_east_team_dict, nl_central_team_dict, nl_west_team_dict]

And now create our plots. Note the small adjustment to the handling of the ax object now that there are multiple rows

ax[row][col]: the axes object contains a numpy array for each row with 5 sub plots

# create a plot that's 6 rows and 5 columns wide
fig, ax = plt.subplots(nrows=6, ncols=5,figsize=(15,15),sharex=True, sharey=True)# set an overall title for the plot
fig.suptitle('MLB: Spend more, Win more?', ha='center',color = 'grey',fontsize=20,va='top',)# help the subplots fit better
fig.subplots_adjust(hspace = .5, wspace=.001)#ax[count[pos]:  the axes object contains a numpy array for each row with 5 sub plots (count indicates the row and pos indicates the column)
count = 0 # sets the row to be filled 
for league in league_list:
    pos=0 # sets the column to be filled by a given team in the division
    for team,color in league.items():
        plot_team_row(df_merged[(df_merged['teamID']==team)], ax[count][pos],color)
        pos=pos+1
    count += 1

And at last we have the full chart!

The full source code is available on GitHub