Exploratory Data Analysis of the Money-Ball Dataset.

6 min readSep 20, 2021

Exploratory Data analysis of the infamous Money-Ball Dataset to find a list of under valued players who could potentially replace the 3 key players lost by the Oakland A’s in the year of 2001.

-R Aisvath

The Back Story

Following the 2001 season, Oakland saw the departure of three key players (the lost boys). Billy Beane, the team’s general manager, responded with a series of free agent signings. The new signees, despite a lack of star power, surprised the baseball world by besting the 2001 team’s regular season record. The team is most famous, however, for winning 20 consecutive games between August 13 and September 4, 2002.

This unbelievable story was written into a book aptly titled “MoneyBall” and was even made into a movie starring Brad Pitt and Jonah Hill.

Because of the teams smaller revenue, They were forced to come up with a unique solution to the problem. Billy Beane and Paul DePodesta analyzed the data of the available players and found a list of under valued players who could potentially replace those who left.

In this project I’ve worked with some of the data to find the suitable replacement players in my own way.

Tech Stack Used

Statistical Programming language — R

Packages for visualization and data manipulation — ggplot2 and dplyr.

Dataset — available from Sean Lahman’s Website

Download Lahman’s Baseball Database — SeanLahman.com

Let’s Get Started

Use R Studio to open and load the csv files for the batting data and the salary data.

use the following commands in R to do so.

batting <- read.csv(Batting.csv)
salaries <- read.csv(Salaries.csv)

2. Include the required packages for manipulation and plotting

library(ggplot2)
library(dplyr)

3. Check the structures of the batting data set.

you can use the function head() to check the first 6 rows of the dataset.

4. Adding additional columns to the dataset.

According to Billy Beane and Paul DePodesta, To judge the caliber of the players , we need to know a few extra parameters such as the Batting Average(BA), On Base Percentage(OBP) and the Slugging Percentage(SP).

Adding the Batting Average.

The batting average is calculated as number of hits divided by number at bats.

AVG = H/ AB

here H is the number of hits and AB is the number of chances at base a player gets to bat.

Adding the OBP & SP

The on base percentage refers to how frequently a batsman reaches the base. The Slugging percentage determines the batting productivity of a hitter.

Adding OBP and SLG columns to the dataset.

5. Merging Salary data and the batting data

We have 2 datasets wherein batting.csv contains the details of the batting stats of the players and the salaries.csv which contains the salaries of all the players for every year.

We are merging the two datasets mainly due to the reason that we don’t just want the best players, we also want the most undervalued players.

Make sure that the batting data contains only stats of players playing after 1985. (Since the salaries of those players have only been given)

we are creating a new data set named combo by merging the 2 datasets into one dataset by using the merge() function with 2 columns playerID and yearID which are in common.

Identifying the stats of the lost players

As mentioned earlier, The Oakland A’s lost 3 key players Jason Giambi, Johnny Damon and Gustavo “Ray” Olmedo. To potentially find replacements, we’ll first have to analyze the stats of these players and then look for replacements based on this.

Let’s create a dataset consisting of these 3 players.

I’ve named them the unlucky3 for a reason.

2. Filter the data further.

Since we are looking for replacements in the year 2001 and these players were lost in 2001, let’s filter the dataset accordingly.

And also only display the columns that are of use to us.

Now that we have obtained the stats of the 3 players who were lost, let’s find suitable replacement players.

Finding the replacement players

Now that we have the stats of the lost players, we can find the replacement players based on a certain set of rules.

The total combined salary of the three players can not exceed 15 million dollars.
Their combined number of At Bats (AB) needs to be equal to or greater than the lost players.
Their mean OBP had to equal to or greater than the mean OBP of the lost players

Following these rules/constraints , we can obtain a list of potential replacements.

Filtering the available players based on yearID

We need players for the yearID 2001 and to further filter, we can go for a simple scatter plot to identify the ranges for filtration. This can be achieved using the below code.

availplayers <- filter(combo,yearID==2001)

2. Scatter plot to identify ranges for filtration

When we plot the below scatter plot taking the On Base Percentage (OBP)on the x axis and the salary of those players on the y axis, we can arrive at a few conclusions

OBP of a 1.0 means that the batter was on base all the time, this could mean that he faced only one ball.

OBP of 0.00 means that the batter has not faced even a single ball. These data points are not required and can thus be removed.

So we can filter the data such that the OBP is between 0 (not inclusive)and 0.50( inclusive ). We can also set the maximum salary to be around 7 million per player since we are trying to minimize costs.

3. Further Filtering based on At Bats (AB)

The combined umber of At Bats should be greater than or equal to those of the lost players. The sum of the At Bats for the lost players was close to 1500. So the Average At bats is close to 500. We have to take only those players who At Bats is greater than or equal to 500.

We’ve identified close to 81 replacements who satisfy the above conditions.

4. Order the players based on higher OBP.

Based on their OBP and the salaries, The players can be obtained as follows. Higher the OBP , better the quality of the player.

In the above final data set, we cannot include the first person with id giambja01 as he is one of the players who left.

Any of the other players in that data frame all satisfy the above conditions and can serve as potential replacements.

The final list of available players looks something like this.

Conclusion

In conclusion , we have found a list of players who can serve as potential replacements to the 3 who left. These players have similar stats and are under valued therefore making them viable options for the Oakland A’s.

We do have to keep a few things in mind relating to practicality. This may include the fact that most of these players are already under contract and acquiring them has a few more steps.

Hence the exploratory data analysis of the money ball data set has been completed successfully.

Exploratory Data Analysis of the Money-Ball Dataset.

Let’s Get Started

Identifying the stats of the lost players

Finding the replacement players

Conclusion

Written by R Aisvath