Data and Baseball: Some Things About The Game Can’t Be Logged In A Spreadsheet

Ainsley Cox
Fall 2023 — Information Expositions
4 min readSep 29, 2023

Since the MLB was founded in 1876, baseball has been a heavily examined sport, not just through its fans, but through the statisticians hoping to predict it. it wasn’t until the late 20th century that this data started to be digitalized, which is when things became divisive. Data, as it has evolved, has arguably transformed baseball into a completely new game than what it started as. However, some argue that data should stay out of the game and out of game coverage, arguing that baseball experience and wisdom should be prioritized over data. But does data make baseball better in an irreplaceable way? I would argue that it does, but that doesn’t mean there isn’t any place for experience in baseball. Quite the opposite actually, we need baseball data to build experience and best inform those who are best equipped to make big, team altering decisions. And does data really affect the amount of people who want to watch baseball? Maybe not as much as people want you to believe.

From the datasets that we have had access to in class, most of them are about the game of baseball and are collected for the purpose of improving a teams performance, therefore, there was plenty to examine. But which few could arguably best explain a teams performance? I turned to the pros in this case, researching to see what MLB players thought were the most important statistics gathered from their games. Overwhelmingly, they agrees that RBI (Runs Batted In) was the stat that they learned the most from. My plan was to examine how these variables have changed overtime as technology and data have become more integrated into baseball. I also wanted to examine baseball viewership, so I chose to also examine game attendance to see how data’s increasing role in baseball has affected fan turnout.

This graph displays the RBI per decade from the Batting Dataset

First up was RBI, which I accessed through the batting dataset. I decided for the sake of being concise to group by decade, totaling the RBI’s of every team for 10 years and comparing that to their following decades. From the figure above, you can see that RBI’s have increased steadily across the last 150 years with a few exceptions. The drops in the years 1910s and 1940s can be explained by WWI and WWII respectively. The low start in the 1870s can be explained by the fact the the MLB wasn’t founded until 1876, leaving them only 4 years to collect data. And over course, the staggering drop from 2010 to 2020 can be explained by two factors: the COVID-19 pandemic, and the fact that we are only 3, almost 4, years into the decade.

Another decrease shown is the decrease between 2000 and 2010, which I would say could be attributed to the crackdown on doping in the MLB, though that didn’t even drop performance that much. After running descriptive analysis on the RBI grouped by decade, it does seem that the average RBI per player has decreased since 1870, but this could be attributed to the fact that there are now many more players playing than in previous decades. The max RBI per player though has increased since past decades though it has seemed a little inconsistent. For example the max RBI in the 1910s was 130, while in the 2010s it was 139. The average RBI of the 2020s so far is 9.45 while in the last 100 decades it’s reached as high as 26.55. It is important to note though that 2020 began with the COVID-19 pandemic, which could be another reason why the average RBI is so low right now.

This graph reflects the total home game attendance per decade from the Home Games dataset

On this graph, there is no drop in attendance until the 2010s. Personally I have attributed this to the MLB’s crackdown on doping which could have kept some people from attending the games as they perceived them as rigged or unfair. Descriptive statistics from game attendance have shown that the minimum attendance has not been 0 since the 1910s though there has been a decline since the 1990s. I attribute this to easy access of games through television though rather than a disinterest in the statistics of the game. In fact you could argue that data has made the game more popular through analysis on television that has pulled people away from attending the games in-person, however I do not have access to data to reflect this.

Obviously, you can’t separate data from baseball, you can’t separate data from any sport for that matter. It is engrained in them and it has evidently improved performance across the board. And while it’s true, not everything can be predicted, logged, and tracked, data makes baseball better. Data coverage may not be what the fans are looking for, but data will always be at the heart of the Great American Pastime.

--

--