Can Temperature Predict Attendance at MLB Games?

Jordan Bean
Coinmonks
7 min readJul 11, 2018

--

Part 1 of this series identified the problem statement and what we are looking to demystify through data and went through acquiring and cleaning the required data. As a reminder, the question we’re trying to answer is: Does temperature have an effect on attendance at Major League Baseball games?

Part 2 started some exploratory data analysis via graphical analysis on transformed sub-segments of the data. For example, we verified the trend of overall declining attendance, identified outlier data points and explanations for them, looked at other non-weather factors that could affect attendance (i.e. time per game, offensive output).

Finally, we looked at a series of scatter plots attempting to look for a relationship between Temperature and attendance per game by using the monthly average temperature and attendance per game for each city between the years 1990–2017.

Overall, there was little to no relationship between the two variables (r-squared of 0.09 where 0 = no relationship). Understanding that in the summer months, a “below mean” temperature isn’t actually a problem (and actually may be an advantage in some cities), we then looked for the relationship between the two variables in the colder months (April, May). Once again, there was still no meaningful relationship.

The last cut we looked at was cold weather cities (average April temperature below 55 degrees) during the spring months. If there were a relationship between the two variables, it should surely show up here, but it didn’t. Where does this leave us?

The final two steps I’m going to take in this series — and there are surely more to pursue further — is to: 1) Attempt to model the attendance-per-game figures with the variables in the data and 2) Look for statistical significance in the difference between above- and below-mean attendance per game in spring games in cold weather cities (next post).

The full code is available here and any feedback is welcomed!

Modeling

The goal of our modeling will be to answer the question: Can we accurately predict attendance based on temperature (and whether it is above or below the mean), month, and team?

To achieve this, I chose to attempt two different types of models: Linear regression and Random Forest. In full disclosure, this is the first time I’ve taken modeling outside of a course, so I’m sure there’s ample room for improvement, and I welcome any thoughts or ideas.

Linear regression attempts to fit a line of degree 1 to the datapoints by assigning a relative slope to each variable. The Random Forest model is a series of decision-trees aggregated to output a single prediction for each observation.

The first step to modeling is to prepare the data. In this case, there are two important categorical (non-numeric) variables that I wanted to include, and therefore had to transform. First is Home team, and second is month.

The reason these variables are important is that different teams have different stadium capacities and popularity, as seen in the graph below. We want to account for the popularity of the home team in our predictions. We could also use a variable for visiting team on a game-by-game data set, but that wasn’t included for this analysis. Month is important because the average attendance per game also varies slightly by month, and therefore we will want to include that variable too.

In order to include these variables, we “one-hot-encode” them. This process creates new columns for each variable, and when the variable is true, assigns it a value of 1, otherwise 0. Below is a representation of the one-hot-encoded data. The first five rows all correspond to the Minnesota Twins, so that column has a 1, with the rest being marked as 0. The first line of code creates the encoded variables, while the second line joins the new columns to the broader dataset.

Baseline model:

In order to determine the effectiveness of the model, it’s important to establish a baseline prediction with a standardized error calculation. This creates a measurement by which to compare other model predictions. If the model can improve on (lower) the error, then machine learning could be an effective resource in a problem. If not, then more data gathering, different inputs, or a new approach may be needed.

For a baseline prediction, I matched the mean monthly attendance for a city to each monthly team data point and calculated the Mean Absolute Error (MAE), which aggregates the total error between the predicted value and actual value, then divides by the number of observations.

The way to interpret the first data point below is: in April 1990, the average attendance per game was 23,700 for the Yankees, and for all years in this analysis (1990–2017), the mean attendance in April for the Yankees was 36,641.

The MAE using this approach was ~5,550, which can be interpreted as, on average, the difference between the “predicted” attendance per game and actual attendance was about 5,550 fans.

Linear Regression

Using Python’s sci-kit learn library, we’ll look to model the attendance per game. For the sake of this analysis, the following variables were chosen. The reason for choosing these was that we are determining if weather and team can predict attendance.

Because the order of magnitude is different between the variables, we also have to Scale them to have a common range. This mitigates the problem of a variable with a higher absolute value dominating the model simply because of its size.

One problem that I ran into with a simple multi-variable linear regression was the magnitude of the coefficients. As I ran the model, the coefficients for the team and month encoded variables were uniform (i.e. all month variables were the same, as were all team variables) and their magnitude far exceeded what would be within the reasonable bounds.

I knew something was wrong because, as evidenced by the exploratory data analysis, we know that the average attendance per game differs both by month and by team, meaning that attendance for New York in July is different than Tampa Bay in April, and our model should reflect that.

Therefore, I ran two variants on the ordinary least squares linear regression, the Ridge and Lasso regressions. Ridge and Lasso regressions are both means to avoid overfitting of more complex linear models, with the main difference being Ridge regression minimizes non-important features, while Lasso pushes the coefficients for irrelevant factors all the way to 0, thereby eliminating them from consideration. For more information, I found this article to be well-written in layman’s terms explaining why these are beneficial in linear regression modeling.

Ridge & Lasso

The Ridge and Lasso regressions give us a picture of the magnitude of change for each incremental change in our scaled variables. Below shows the coefficient values for each variable for the two regressions. Because they’re so similar, I’ll only talk through the Ridge regression findings.

Variables with higher absolute magnitude are more influential to our final model predictions. For example, the LA Dodgers and Yankees being the home team significantly increases the predicted attendance per game, while Tampa Bay and Miami significantly decrease the prediction.

Directionally, our Temperature variable is positive, meaning that as Temperature increases, so does attendance per game, at a rate of ~870 fans.

The mean absolute error — our scoring metric — is ~5,665 for each of the three regressions (simple linear, ridge, lasso), which doesn’t quite measure up to our baseline error of ~5,550.

Random Forest Model

I also spent some time trying to develop a Random Forest model to predict the attendance per game variable.

At a high level, Random Forest modeling takes a series of decision trees and averages the result to provide a single concrete output. For example, if we were trying to classify whether the temperature was above (1) or below (0) the mean, the Random Forest would create a set number of decision trees (say, 1,000) with each one representing a decision tree from a random sample of the data.

Each tree would give a value of 1 or 0 as a prediction, and the final output would be whichever appeared more often over the course of the 1,000 randomly sampled decision trees. For regression, the same process takes place of creating the trees, and the final result is the collectively averaged value from each tree.

I won’t go into too much detail on this one as my early attempts yielded a MAE of ~5,917, so I couldn’t get the predictive power to match the regression or baseline models, but all of the code can be found in the notebook.

Coming up next:

In the final post in the series, I’ll run some statistical significance test and write up the conclusions of the exercise.

--

--

Jordan Bean
Coinmonks

I create original content that connects data, analytics, and strategy. Support my work by becoming a member jordanbean.medium.com/membership