Is MLB Attendance per Game Significantly Different When It’s Cold?

Jordan Bean
Coinmonks
7 min readJul 12, 2018

--

In previous posts, we’ve defined the question of whether temperature affects attendance at Major League Baseball games, explored the data, and attempted to model the relationship between temperature and attendance.

In the final post of the series, we’ll approach the problem from a statistical perspective (full Python code here).

In this post, the question we’ll be asking is: Is there a statistically significant different attendance per game in cold weather cities during the spring when the monthly average temperature is above or below the median for that month in each city?

In order to answer this question, we’ll need to first cut our data to match the stated segmentation (cold weather teams, spring months), run some basic exploratory analysis on the new data, take the mean for each set (above and below the median), compare it to the sample mean, then run a difference of means analysis to generate a p-value on the hypothesis that there is no difference between the two variables.

Data segmentation and exploratory analysis

From our previous exploratory work, we already have a data set labeled “cold_spring_teams” that identified the teams with cold (less than 55.0 degrees average temperature in April) weather, so we can create a copy of that to jump start this section.

Because we’re no longer concerned with the team or other variables, we can also slim down the data to just the “attendance per game” and “below median” variables.

(Note: Previous analyses used a “below_mean” variable, while this analysis uses the median temperature)

We then take the mean temperature of the data below the median (value of “below_median” = 1) and equal to or above median (below_median = 0) and find that there is in fact a somewhat significant surface-level difference of ~1,000 fans, or about 3.5%, between the two groups.

We continue by visualizing the data in the dataset with a cumulative distribution plot. The purpose of this plot is to understand how the data distributes across the spectrum of attendance per game values. The way to interpret the curve is, at the intersection of the dashed red lines, approximately 65% of the data points are at or below 30,000 fans per game.

Statistical significance

To understand how this difference looks from a statistical perspective, we’ll take a series of sample data points, with replacement, of the data and calculate the mean of those points, then plot how that set of means distributes on a histogram. In the chart below, we can see how the sample means distributes, as well as the 95% confidence interval for the data and where the below- and above-median values fall on that spectrum.

Numerically, our 95% confidence interval (purple bars) for the real mean based on the sample data is (26,605.5, 27,754.5) while the below-median average is 26,691.8 (green) and the above median average is 27, 621.3 (gold). Because both of our variables are within the 95% confidence interval, albeit narrowly, we can’t say from this analysis that the means are statistically different from the sample mean of 27,173.8.

Difference of means test

To further emphasize this point, we test whether the difference between the two means is different than 0, with the null hypothesis being that the difference is zero. In order to do this, we will again take a series of randomly sampled data, this time without replacement using a permutation test.

A permutation distribution treats the data as if they come from a single distribution by taking each distinct input series, shuffling the values such that there’s a new single set of data with no labels, and computing the desired statistic (in this case, difference of means). The advantage to this approach is that is assumes no difference between the two series when computing the test statistic, thereby aligning with our null hypothesis of no difference.

In the context of this project, this means that the above- and below-median average temperatures are considered to come from the same distribution, and therefore have a mean difference of 0.

Taking a random sample of size 10,000 from our data, we generate the histogram of values below:

Our observed difference is ~930 fans, as noted in the beginning of this section, and that value is marked by the green line in the graph above.

Using the permuted values from our random sample of 10,000, we can compute a p-value that will give us the percentage of values that are at least as extreme as the one that we observed (~930 fans). If this value is less than or equal to 0.05, we can say at a 95% confidence threshold that we reject the null hypothesis that the two distributions are even. Otherwise, we fail to reject the null hypothesis.

And, when doing this, we get an observed p-value of 0.057. Though this is extremely close to our 0.05 threshold, we can’t reject the null hypothesis.

Conclusions, limitations, and next steps

What does this all mean? It means that over the 27-year time period analyzed, we can’t say with statistical conviction that Temperature has a meaningful affect on attendance per game, even during cold weather cities during the early part of the season.

We arrived at this conclusion via graphical analysis, modeling, and statistical significance testing. Our scatter plots showed little relation between the two variables, while showing more magnitude of difference between a variable like team.

In modeling, using city, month, and temperature scaled to normalized values, we saw that the Temperature coefficient was positive (meaning that attendance per game increases as temperature increases) and somewhat influential, but removing the Temperature and below_median variables only increases our MAE from ~5,895 to ~5,914, indicating that the predictive power changes little when removing the factor.

And, in this post, we explored the statistical significance of the results by performing a set of statistical tests to the data, including applying bootstrap statistics by resampling data and permuting the difference of means between above and below median attendance per game. Though the results were close to significance, they did not hit our thresholds, and thereby we weren’t able to conclude that a statistically significance difference exists.

As with any analysis and predicting, there’s both room for improvement on existing techniques and the opportunity to add more data. In predicting attendance per game, we could likely improve the results in a few different ways. We could run the scrape that I mentioned for each year and get micro game detail for temperature, attendance, and game statistics, and look at changes in extreme-weather games (i.e. below 45 degrees or above 90 degrees).

We could then add variables for the visiting team (i.e. the Red Sox likely draw more fans in Tampa Bay than if the Minnesota Twins are the visiting teams).

We could also add a variable for the number of wins in the lagging year, whether the team made the playoffs, won their division, league, or World Series, with the idea being that more successful teams will draw more fans. Likewise, we could add a variable for cumulative number of wins in the season up to the point of each observed game or if the team is in playoff contention.

All of these would likely serve to improve our model performance in tandem with the variables used here. However, this analysis was focused specifically on weather’s (and team’s) predictive power on attendance.

Finally, we have to consider some of the limitations to our analysis. Largest among them is that “attendance per game” is typically measured as the number of sold tickets for a game rather than the actual number of people that show up. On a cold night in Boston, there are undoubtedly people that bought tickets who never show up, or they leave early and the stadium isn’t actually full most of the game. Likewise, some tickets languish on the secondary markets and never sell.

Other limitations include the inability to quantify items such as how safe is the neighborhood in which the ballpark resides, how enjoyable is the game day experience, or the passion for the game that exists in a city.

Through our analysis, we found that while Temperature is a factor in attendance per game, it is not a meaningful predictive variable. While early season games can be cold, they also represent the excitement of the beginning of another season. There may also be an new rookie or player on the team that fans are clamoring to see, and the tickets are sometimes cheaper than mid-summer games.

Or, in the case of a team like the Boston Red Sox, which sold out a record 820 straight home games, the city just loves the game of baseball.

--

--

Jordan Bean
Coinmonks

I create original content that connects data, analytics, and strategy. Support my work by becoming a member jordanbean.medium.com/membership