Dingers Through Data: What Exploratory Data Analysis in R Tells Us About MLB’s Steroid Era

Peter Carley
The Startup
Published in
7 min readNov 21, 2020

Intro:

Growing up in the tail-end of the steroid era, baseball captivated my attention as a child. From playing wiffleball with the Mark Mcgwire Vortex power bat to watching Sammy Sosa get busted with a corked bat, the notion that people cut corners for power was instilled in me early. As time has passed, the juiced-up superstars of the MLB have finished their careers, but their residual effects are still apparent in both their reputations and in their statistics. In the following analysis, I explore MLB hitting data from 1940–2015 to understand key statistical differences between MLB hitters over time. To import, transform, and visualize the data, I used RStudio’s ggplot2 and tidyverse packages.

Extracting and Importing the Data:

To assist with my analysis of the steroid era, I utilized the “History of Baseball” datasets from Kaggle (https://www.kaggle.com/seanlahman/the-history-of-baseball). Because I was only interested in post-1940’s hitting statistics, I only examined player and hitting data from beyond that era. This came from two datasets: “Player” and “Batting”. Finally, I created my own dataset in Microsoft Excel, called “Mitchell Report Player List”, which contained a list of players mentioned in the Mitchell Report, an independent investigation into the use of anabolic steroids and HGH in the MLB.

The “Batting” dataset contained hitting statistics for each player in the MLB by year. This amounted to over 100,000 observations (rows) in the dataset, and contained 22 different variables. Some of these variables include home runs, batting average, slugging percentage, hits, and RBIs. The “Player” dataset contains personal information about each player ID contained in the batting dataset. Meanwhile, the Mitchell Report list contained just the full names of the players who were mentioned. To import the data, I used the common read_csv and read_excel functions in R.

Transforming Data

Before I began my exploratory data analysis, I had to do some data cleaning. First, I needed to join relevant information about each player to their hitting statistics by their unique identifier (player_id). Additionally, I wanted to normalize some of the relevant hitting statistics. Because the number of games played by players change on a year-over-year basis, I wanted to observe measures that could be normalized by the number of games played. To do so, I normalized some figures (home runs, RBIs, etc.) to be per game. To ensure that there was a sufficient sample size, I excluded any observations where the number of at bats was under 100. Finally, to create a variable that showed YOY jump in Home Runs, I grouped the data by player ID and arranged it by year so I could use R’s “lag” time series function, which enabled me to pull the previous qualifying year’s statistic for HR/Game. This, in turn, allowed me to calculate a “JUMP” variable, which indicates the change in Home Runs/Game a player had from the previous year. Below, I have posted the code I used to join and clean the data:

Exploratory Data Analysis

Below are the distributions I found for some of the key hitting statistics, as well as how the means and medians changed over time:

While I expected to see an increase in power measures, both the median and mean batting averages increased from roughly .255 in 1980 to over .270 in 2000. Not only were players hitting for more power but also for better contact. It’s interesting to see how starkly the batting averages dropped back down after the early 2000's.
The trend towards power hitting hasn’t necessarily been linear; from the mid 1980’s through the early 2000’s, there was an exponential rise in home runs per game. From the 1960’s through the 1980’s, home runs per game was relatively stagnant. Given the right-skewed distribution in HR/game, it is easy to see that outliers significantly drive up the mean figures relative to the median.
As with home runs and batting average, there is a clear indication that the late 1990's-early 2000’s marked the peak of hitting performance; players averaged roughly .1 more RBIs per game in 2000 than they did in 1990. Over the course of a season, that equates to roughly 16 more runs batted in per player!

Given the sharp increase in slugging numbers in the 90s-early 2000’s, I was interested to see how the distribution of certain statistics looked between the two eras. By creating two time periods, “Pre/Post Steroid Era” and “Steroid Era (1990–2005)”, I could compare the distribution of certain variables between the two time frames. While the time periods are arbitrary (steroid use certainly has existed outside of this time frame), I used these as a proxy for the era when steroid use was at its peak. In the following density plot, I show the difference in the homerun/game distribution between the “steroid era” (1990–2005) and the pre/post steroid era:

Notice the blue area of the overlapped density plot that pops out: in the 1990’s-early 2000’s, it seems there was a higher proportion of sluggers with greater than roughly .075 home runs per game
Similar story with RBIs per Game: the curve is more flattened in the steroid era, indicating that a higher proportion of players were having higher RBIs/Game.

The “Jumpers”: Which Players May Have Been Juicing?

As both substantiated and rumored reports have emerged regarding MLB players’ steroid usage, there are a number of players whose statistics are worth investigating. After seeing the clear uptick in slugging statistics from the 1990s through the early 2000’s, I wanted to examine which players saw major jumps in hitting performance from one season to the next. To do this, I created a variable: “Jumps in HR Per Game”, which finds the difference between a player’s prior year home runs/game and their current year home runs/game. Additionally, I created a variable (In Mitchell Report) that indicated if a player was mentioned in the Mitchell Report, an independent investigation into the use of anabolic steroids and HGH in the MLB. Interestingly, in the 90’s and early 2000’s, a large number of the players who showed big “jumps” in homeruns per game also were mentioned in the Mitchell report. Below is a scatter plot showing the jumps in HR/game over time, color coded by whether a player was mentioned in the report:

As we look towards the top section of the plot between 1990–2010, there are a number of players with a large year-over-year jump in Homeruns per game who were also featured in the Mitchell Report
Notice here that the mean YOY jump remains at 0 throughout time; however, there tended to be more variability between the YOY jumps in home runs per game during the “steroid era”, as shown by the lower peak and wider tails.

After examining the outliers (those who had a YOY jump greater than the 1st percentile), I wanted to take a closer dive into which players could have been driving homeruns per game up through breakout “jumps” in power. The following is a scatter plot of players who had breakout jumps above the first percentile (I also included whether they were mentioned in the Mitchell Report):

This graph shows outliers in the “Jump” statistic: where their year-over-year homeruns/game jumped by a number above the first percentile of the distribution. I have color-coded whether they were in the Mitchell report. While there are a number of reasons these large jumps could have existed, it doesn’t seem coincidental that the graph is heavily colored with players who had steroid allegations.

The Outliers: Count of Players with Abnormal Homeruns/Game Over Time

Having examined the distribution of “jumps” in Home Runs/Game, I also wanted to examine the distribution of players with outlier-level home runs/game numbers over time. To create an outlier threshold, I used the 1.5*IQR (inter quartile range) rule: any homeruns/game measure that was greater than the sum of the mean and 1.5x the IQR of the distribution was flagged as an outlier. In simpler terms, I set the threshold for outliers to be HR/game statistic to be anything over .20. In the below scatter, I show the players with outlier-level HR/Game by year (color-coded by whether they were in the Mitchell report):

It becomes clear that the number of data points in the outlier threshold for HR/game is much larger in the late 90s-early 2000’s. In the following scatter plot, I show the count of players with outlier-level homeruns/game over time:

While there have certainly been sluggers with outlier-levels of homeruns/game, this scatter plot shows that the number of players with abnormal home runs/game spiked in the 90’s/early 00’s. Also note the sharp decline back to normal numbers after the MLB became more stringent with its steroid policy.

As the numbers show, the late 90's-early 2000’s was the clear peak in terms of sluggers with abnormal home runs/game numbers. This period coincided with some key moments in baseball history: the McGwire-Sosa home run race of 1998, Barry Bonds hitting 73 home runs in one season, and the Yankees winning 3 world series in a row from 1998–2000 (not saying they’re cheaters, but definitely not saying they aren’t!). Prior to this period, we see that the average number of players with over .2 HR/game hovered around 10; then, in the mid 1990s-early 2000’s, that number spiked to over 50 players!

Putting it all together: Creating a Potential Steroid Candidate List Through Outlier Levels of HRs/Game and YOY jumps in Power

Having examined both the outliers in Home Runs/Game and YOY increases in home runs/game, I wanted to put together a candidate list of players who exhibited outlier qualities in both variables. Their “jumps” in power to an outlier level of home runs/game could be a clue into which year they may have began using steroids. Pasted below is the list of players and their respective years that I flagged, as well as some other relevant statistics that are of interest:

https://airtable.com/shr1AXqnn6MspB212/tbl9JJW3jDK6pptoA

Conclusion:

While the superstars of the steroid era have came and gone, their statistics are indelible to baseball’s history. In this analysis, I hoped to shed some light on (1) how different generalized statistics were between the “steroid era” and the “pre/post” era were, and (2)who accounted for these differences. While it is impossible to pinpoint exactly which players used steroids during this era, the sheer number of players with outlier slugging statistics hints that many of them were likely using outside help. It seems like the steroid era was characterized by slugging, but it would be interesting to see how key pitching performance indicators shifted during this time period as well.

PS: If you enjoyed this article, or have any feedback, I would love to hear it. I also posted my R Code on Github (https://github.com/pcarley1/Dingers-Through-Data-Project).Feel free to add me on Linkedin or shoot me a message:

https://www.linkedin.com/in/peter-carley-24230b9b/

--

--

Peter Carley
The Startup

Data Scientist with a passion for harnessing data to uncover new insights. Linkedin: https://www.linkedin.com/in/peter-c-24230b9b/