Is the Quality of Play in MLB Decreasing?

Published in

Coinmonks

6 min readJul 18, 2018

The Context:

We previously established that attendance at MLB games is, on average, decreasing. The core question to that analysis was whether Temperature had an impact on attendance per game (i.e. did months with a lower average temperature also have decreased attendance per game?). After exploring, analyzing, and modeling almost 3 decades of MLB game and weather data, the conclusion was that no, temperature and attendance do not display a meaningful relationship in aggregate.

The Goal:

Knowing that attendance is decreasing, and that the temperature variable is not a significant correlating factor, we’ll use data to look for other potentially impactful factors. In this three-post mini-series, we’ll try to create data-backed measures for a high-level ambiguous variable — Quality of play — and use a series of analysis techniques in Python to look for meaningful differences.

Quality of play can take many forms and I welcome your thoughts on additional variables to think about. For these posts, I’ve decided to dive into two distinct streams. The first (code here) is exploring changes in the players entering the league, while the second (code here) is digging into metrics related to the final standings and performance of teams over time.

We’ll look at a variety of factors related to rookies entering the league — age, amount of time in the minors, draft round, and more — to see whether there have been any meaningful changes that could give insights into quality of play. I want to be explicit here that the nature of this variable is exploratory and not explanatory.

This is because the variables we look at do not inherently tell us the quality of the player, but rather other demographic factors about their journey to the MLB. I’ll offer some of my own thoughts on the implication or underlying root cause, but broadly speaking what we’ll be looking for is if anything has changed (knowing that attendance has), and if so, what has changed rather than how the change might affect attendance.

The second post will be more concrete because it’s working with verifiable and digestible historical data — wins and losses of MLB teams. In that analysis, we’ll look at a number of variables related to the difference in win percentages and gap between league leaders and laggards.

The Process

Two broad data sets will be needed for this analysis, complemented by a series of one-off needs. First is a data set of players that reached the Major Leagues that includes variables like draft year and length of time in the majors. The second will be a data set with MLB results over a specified time period. We’ll then slice and dice the data in a number of formats and with a variety of variables to look for meaningful results.

Gathering the Data

The fun part about learning a programming language is that you’re not bound by openly available aggregated and posted data sets — you can create your own. I’m sure that there’s a variety of data sets on the internet that have some or most of what I was looking for, but just to practice and because it was beneficial, I decided to make my own.

I found a website that has tracked and stored all players that made the Major Leagues since 1871. Data from the earlier years is sometimes missing or there are gaps, but I didn’t need to go that far back anyway.

So, I wrote code to gather every player in their database and store their name, position, date of birth, time in the major leagues, draft information, and status (current, retired, minor leagues, etc.).

The second data set that I needed was a list of historical results for Major League Seasons. I did a bit of searching, but ultimately decided to do a web scrape for this data too because you can never get enough practice. I found season-by-season data on baseball-reference.com. Due to the way that they structured each page, I was able to pull data from 1970-onward to run the analyses, though their data goes back much further.

A third quick note that I’ll make on data collection that I’ll explain later is that I needed a small data set that would allow me to calculate the number of teams in the league each year (since MLB has expanded its number of teams over the time period analyzed). This was another opportunity to practice web scraping even though it wouldn’t have been too difficult to find and copy/paste from online. For this, I scraped a table from Wikipedia that had the team and its founding year, which was then used to calculate teams per year.

I also used some of the data that I gathered from the previous project in one or two instances.

Data Cleaning

Player data

The nice part about working with a pre-prepared data set is that someone has…prepared the data. When building your own data set, the data is dirty. With the data from the baseball cube, this is a random sampling of the rows:

There are a lot of missing values. Some players have draft information, some don’t. The draft_info column has three different possible formats and a random “UDFA” in some of the columns. The mlb_career column isn’t super helpful as-is because it’s a string range, but we can’t filter on a string value, so if we want to work with the numerical year values, that will have to be split.

After a series of data cleaning techniques that can be found in the code, the initially cleaned new data set looked like this:

We dropped all blank values of draft_info, split the mlb_career variable into start and end variables, converted the date_of_birth variable to a pandas datetime object. We also split the draft_info column into year, draft round, pick number, and draft team. Finally, that random “UDFA” variable has been assigned dummy values of 1 if true and 0 if not true (“UDFA” stands for “Undrafted Free Agent” which means that the player was not chosen during a formal draft and was instead signed at another time).

There were some additional added columns and some eliminating of bad data that we found through exploratory charts, which can be found in the code and some of which will be covered in later posts.

Game Data

The game data set was fairly clean right from the start, mostly owing to the fact that there were only a handful of columns that I needed, and the data was pretty straightforward upon entry.

The only noteworthy element was the “ — ” character in the games_back column. The presence of this character indicates that the team won their division (and therefore was 0 games back), so a simple replace solved that issue. The only other minor formatting included converting the year index to datetime for plotting.

Coming next

Okay, now that we’ve defined the problem, gathered the data, and put it into a usable format, the fun begins. The next post will be doing exploratory analysis on the player data set.

Is the Quality of Play in MLB Decreasing?

Written by Jordan Bean