Dingers Through Data: What Exploratory Data Analysis in R Tells Us About MLB’s Steroid Era
Intro:
Growing up in the tail-end of the steroid era, baseball captivated my attention as a child. From playing wiffleball with the Mark Mcgwire Vortex power bat to watching Sammy Sosa get busted with a corked bat, the notion that people cut corners for power was instilled in me early. As time has passed, the juiced-up superstars of the MLB have finished their careers, but their residual effects are still apparent in both their reputations and in their statistics. In the following analysis, I explore MLB hitting data from 1940–2015 to understand key statistical differences between MLB hitters over time. To import, transform, and visualize the data, I used RStudio’s ggplot2 and tidyverse packages.
Extracting and Importing the Data:
To assist with my analysis of the steroid era, I utilized the “History of Baseball” datasets from Kaggle (https://www.kaggle.com/seanlahman/the-history-of-baseball). Because I was only interested in post-1940’s hitting statistics, I only examined player and hitting data from beyond that era. This came from two datasets: “Player” and “Batting”. Finally, I created my own dataset in Microsoft Excel, called “Mitchell Report Player List”, which contained a list of players mentioned in the Mitchell Report, an independent investigation into the use of anabolic steroids and HGH in the MLB.
The “Batting” dataset contained hitting statistics for each player in the MLB by year. This amounted to over 100,000 observations (rows) in the dataset, and contained 22 different variables. Some of these variables include home runs, batting average, slugging percentage, hits, and RBIs. The “Player” dataset contains personal information about each player ID contained in the batting dataset. Meanwhile, the Mitchell Report list contained just the full names of the players who were mentioned. To import the data, I used the common read_csv and read_excel functions in R.
Transforming Data
Before I began my exploratory data analysis, I had to do some data cleaning. First, I needed to join relevant information about each player to their hitting statistics by their unique identifier (player_id). Additionally, I wanted to normalize some of the relevant hitting statistics. Because the number of games played by players change on a year-over-year basis, I wanted to observe measures that could be normalized by the number of games played. To do so, I normalized some figures (home runs, RBIs, etc.) to be per game. To ensure that there was a sufficient sample size, I excluded any observations where the number of at bats was under 100. Finally, to create a variable that showed YOY jump in Home Runs, I grouped the data by player ID and arranged it by year so I could use R’s “lag” time series function, which enabled me to pull the previous qualifying year’s statistic for HR/Game. This, in turn, allowed me to calculate a “JUMP” variable, which indicates the change in Home Runs/Game a player had from the previous year. Below, I have posted the code I used to join and clean the data:
Exploratory Data Analysis
Below are the distributions I found for some of the key hitting statistics, as well as how the means and medians changed over time:
Given the sharp increase in slugging numbers in the 90s-early 2000’s, I was interested to see how the distribution of certain statistics looked between the two eras. By creating two time periods, “Pre/Post Steroid Era” and “Steroid Era (1990–2005)”, I could compare the distribution of certain variables between the two time frames. While the time periods are arbitrary (steroid use certainly has existed outside of this time frame), I used these as a proxy for the era when steroid use was at its peak. In the following density plot, I show the difference in the homerun/game distribution between the “steroid era” (1990–2005) and the pre/post steroid era:
The “Jumpers”: Which Players May Have Been Juicing?
As both substantiated and rumored reports have emerged regarding MLB players’ steroid usage, there are a number of players whose statistics are worth investigating. After seeing the clear uptick in slugging statistics from the 1990s through the early 2000’s, I wanted to examine which players saw major jumps in hitting performance from one season to the next. To do this, I created a variable: “Jumps in HR Per Game”, which finds the difference between a player’s prior year home runs/game and their current year home runs/game. Additionally, I created a variable (In Mitchell Report) that indicated if a player was mentioned in the Mitchell Report, an independent investigation into the use of anabolic steroids and HGH in the MLB. Interestingly, in the 90’s and early 2000’s, a large number of the players who showed big “jumps” in homeruns per game also were mentioned in the Mitchell report. Below is a scatter plot showing the jumps in HR/game over time, color coded by whether a player was mentioned in the report:
After examining the outliers (those who had a YOY jump greater than the 1st percentile), I wanted to take a closer dive into which players could have been driving homeruns per game up through breakout “jumps” in power. The following is a scatter plot of players who had breakout jumps above the first percentile (I also included whether they were mentioned in the Mitchell Report):
The Outliers: Count of Players with Abnormal Homeruns/Game Over Time
Having examined the distribution of “jumps” in Home Runs/Game, I also wanted to examine the distribution of players with outlier-level home runs/game numbers over time. To create an outlier threshold, I used the 1.5*IQR (inter quartile range) rule: any homeruns/game measure that was greater than the sum of the mean and 1.5x the IQR of the distribution was flagged as an outlier. In simpler terms, I set the threshold for outliers to be HR/game statistic to be anything over .20. In the below scatter, I show the players with outlier-level HR/Game by year (color-coded by whether they were in the Mitchell report):
It becomes clear that the number of data points in the outlier threshold for HR/game is much larger in the late 90s-early 2000’s. In the following scatter plot, I show the count of players with outlier-level homeruns/game over time:
As the numbers show, the late 90's-early 2000’s was the clear peak in terms of sluggers with abnormal home runs/game numbers. This period coincided with some key moments in baseball history: the McGwire-Sosa home run race of 1998, Barry Bonds hitting 73 home runs in one season, and the Yankees winning 3 world series in a row from 1998–2000 (not saying they’re cheaters, but definitely not saying they aren’t!). Prior to this period, we see that the average number of players with over .2 HR/game hovered around 10; then, in the mid 1990s-early 2000’s, that number spiked to over 50 players!
Putting it all together: Creating a Potential Steroid Candidate List Through Outlier Levels of HRs/Game and YOY jumps in Power
Having examined both the outliers in Home Runs/Game and YOY increases in home runs/game, I wanted to put together a candidate list of players who exhibited outlier qualities in both variables. Their “jumps” in power to an outlier level of home runs/game could be a clue into which year they may have began using steroids. Pasted below is the list of players and their respective years that I flagged, as well as some other relevant statistics that are of interest:
Conclusion:
While the superstars of the steroid era have came and gone, their statistics are indelible to baseball’s history. In this analysis, I hoped to shed some light on (1) how different generalized statistics were between the “steroid era” and the “pre/post” era were, and (2)who accounted for these differences. While it is impossible to pinpoint exactly which players used steroids during this era, the sheer number of players with outlier slugging statistics hints that many of them were likely using outside help. It seems like the steroid era was characterized by slugging, but it would be interesting to see how key pitching performance indicators shifted during this time period as well.
PS: If you enjoyed this article, or have any feedback, I would love to hear it. I also posted my R Code on Github (https://github.com/pcarley1/Dingers-Through-Data-Project).Feel free to add me on Linkedin or shoot me a message: