How Accurate are Groundhog Day Predictions?

Punxsutawney Phil (Newsweek)

We’re getting to that time of year again where an omniscient rodent informs us whether or not we can put away our winter gear early and get ready for fun in the sun. I’m, of course, talking about Groundhog Day, the North American tradition of turning to the wisdom of groundhogs for the weather forecast.

The most famous of the groundhogs is Pennsylvania’s Punxsutawney Phil, whose ancestors (all named Phil) got into meteorology back in 1887 after failing to find success in the burrowing business.

Every February 2, thousands gather in Punxsutawney to watch as Phil is wrenched away from important groundhog business so that he may predict the weather for us. When he emerges from his burrow, if he sees his shadow, it means we’ll have six more weeks of winter. If he doesn’t see his shadow, we’ll have an early spring.

Phil has been at it for more than a century, so he must be doing a hell of a job. But every so often it’s good to take a step back and check our assumptions. How accurate are Phil’s weather predictions?

Fortunately, there are methods for us to answer such a question. In our endeavor to do so, I will be drawing from fields such as data science, statistics and machine learning. I do my best to explain concepts at a high-level here, but in no way are my explanations complete or precise.

(If you’re interested in the nitty-gritty, the data and code can be found at:

Step 1: Gathering the Data

First, we need to gather the data: Punxsutawney Phil’s predictions for each year, and what the weather was actually like in each of those years.

Phil’s Weather Prediction Record

I found Phil’s predictions on, which has compiled a table based on the “official” records kept by the Punxsutawney Groundhog Club. For each year the table notes whether or not he saw his shadow. (Unfortunately, the records from several years have been lost, but there’s still plenty to work with.)

Weather Records

To get the weather data I turned to NOAA’s Climate at a Glance tool. This allows you to pull average temperature data for individual states, or the entire contiguous United States, at different time scales going all the way back to 1895. Conveniently this is just a few years before Phil got his start.

Now, I’m not sure exactly what “early spring” or “six more weeks of winter” mean, so to cover my bases, I pulled the average temperatures for the U.S. in one-month chunks (February, March, April), in two-month chunks (February-March, March-April), and finally in a three-month chunk (February-April). In total, six different ways of looking at the period of time in question.

When we get to the analysis, only one of these periods will be considered at a time. This means February temperatures will be compared to February temperatures in other years, March temperatures compared to March, and so on.

Just to give you an example, this is what the month of March looks like across the years 1895 to 2016.

One thing that becomes apparent when you look at the graph is the rising average temperature over the past half century. You can see that from 1960 to the present, March is, on average, getting warmer. But we only care about whether an individual year was hotter or colder than normal, so we’ll want to control for this underlying trend.

The Hodrick–Prescott filter does precisely that. It works by smoothing out the short-term fluctuations. The resulting trend is shown as a red dotted line in the above graph. When we subtract the trend from the absolute temperature, we’re left with what’s known as the “cyclical component,” shown below.

Using this graph, we can define “early spring” as years where the temperature is above the zero mark (warmer than normal), and “six more weeks of winter” as years where the temperature is below the zero mark (cooler than normal).

Step 2: Analysis

Now that we’ve defined our terms and sorted our data, let’s try and answer our initial question: Can Punxsutawney Phil actually predict the weather?

What we’re looking to do is test whether there is any correlation between Phil’s prediction and the weather for any given year. If Phil predicts an early spring, does it tend to be warmer, and vice versa?

But finding a correlation isn’t enough. We want to ensure that any correlation is statistically significant. In imprecise terms, a statistically significant correlation is one that we can say, with a certain degree of confidence, exists not just by pure chance.

Making Sure Our Results Are Statistically Significant

A good way to illustrate this concept may be to consider two sets of data, one with only three data points, and another with 10 (shown below). The points come from the same line, but are randomly perturbed up or down. If you wanted to guess the line, you’d probably draw something like the red line in the graphs. But with only three data points, it’s hard to be certain you’ve got the answer right. A small shift in only one of those points could have a big impact on the line you draw. With 10 points, small changes to any one point wouldn’t impact your guess much, because you have all the other points still falling along the line.

A toy example to illustrate statistical significance.

When we test for statistical significance, we need to choose a significance level to use before we start the analysis. This is the probability that we will get a false positive, or rather, that we will see a correlation just by coincidence. A common significance level statisticians use is .05, meaning there is a 5% chance we find a relationship in the data that doesn’t exist in reality.

When we do the analysis, we will obtain something called a p-value. A p-value is the probability that we found a correlation between two things that aren’t actually correlated. We can say that the observed correlation is statistically significant if the p-value is less than the significance level we’re using. When this is true it means that it’s more likely for the correlation to exist in reality than simply being a fluke in the data.

In the example graph above with three points of data, a p-value of .217 means there is a 21.7% chance that these plotted points are random, with no real relationship between them. This 21.7% chance outweighs the .05 significance level, so the correlation is not statistically significant. On the other hand, with 10 points, we have (almost) a 0% chance that we’re wrong about a correlation between X and Y, and thus it is statistically significant.

Testing Phil’s Accuracy

To test for a correlation, we can use an ordinary least squares linear regression (OLS) model. OLS regression can tell us the best relationship between a set of predictor variables and a target variable. In the previous example, our predictor was X and the target was Y. For our Groundhog Day analysis, our predictor is Phil’s prediction, and the target is the temperature offset.

Running the six different models, one for each of the time periods, one does, in fact, prove to be statistically significant (at a significance level of .05): When Phil predicts an early spring, April is on average about 1 °F colder. Whoops.

The below graph illustrates the correlation. The temperature offset for April is plotted against the year. Points are colored by Phil’s prediction for that year — blue when he predicted 6 more weeks of winter, and red when he predicted an early spring. As you can see, many more of the red points are concentrated below 0. This means that if you were trying to bet whether April would be hotter or colder than average, based only on Phil’s prediction, you would be better off betting colder.

10 of the “early spring” predictions were in years where the temperature in April was below average, while only 5 of those predictions occurred in years where it was warmer. This large difference is why the relationship is statistically significant.

On the other hand, taking a look at another month, we don’t see a clear preference one way or the other. The red points are fairly evenly distributed above and below the line.

7 of the “early spring” predictions were in years where the temperature in March was below average, and 9 of those predictions occurred in years where it was warmer. Without a strong tendency one way or the other, Phil’s predictions for March don’t have a statistically significant correlation.

Are All Those Other Groundhogs Also Good-For-Nothings?

Seeing Phil’s (unearned) fame and fortune over the years, dozens of other groundhogs have gotten into the meteorological business. Wikipedia is the best source of historical predictions for all the other groundhogs. Unfortunately, it only goes back to 2008, so there’s not much to work with, but we’ll do our best.

Repeating the process we did for Phil for all of these other groundhogs, two stand out: Stormy Marmot in Aurora, Colorado and York, Pennsylvania's stuffed groundhog, Poor Richard. When Stormy Marmot predicts an early spring, we can expect March to be on average 6 °F warmer, and April to be 2.5 °F warmer. When Poor Richard predicts an early spring, we can expect February to be 4 °F warmer and March to be 8 °F warmer.

Poor Richard, the stuffed groundhog meteorologist, showing Phil you don’t need to be alive to predict the weather. (Fox43)

We ran almost 300 different models — 48 different groundhogs, and 6 time periods for each. So we should expect, just by random chance, for some of them to be “statistically significant.” In fact, based on our significance level of .05, and 288 models, we would expect about 15 of them to exhibit a correlation even if no relationship exists. This is known as data dredging.

Tyler Vigen’s Spurious Correlations illustrates this problem beautifully. If you take enough random data sets and mash them together, there’s bound to be strong correlations between things that have no causal relationship.

Dammit, Nicolas Cage, would you stop drowning people already? (Spurious Correlations)

Sorry Stormy Marmot and Poor Richard, but you’re probably not as good at meteorology as it would appear.

What If They All Worked Together?

Let’s try one last thing — what if we use the predictions from all of the groundhogs? Could we get a better result than looking at them individually?

This can be seen similar to the Delphi Method or Good Judgment Project, where forecasting is done by a group of experts and the individual forecasts are blended into one. Of course, I’m not sure I’d classify these rodents as experts.

This composite model will be a little different than what we did before. Instead of considering the temperature versus a single predictor, it will be versus 48, one for each groundhog. The model will try to learn which combination of the 48 predictions best predicts the weather. Individually they may not have predictive power, but our hope is that together they do better than any of the groundhogs individually.

In the end, one model stood out as being slightly better than pure chance. 75–80% of the time our super-groundhog correctly predicted what March was going to be like. That sounds pretty good, but for 2008–2016, always picking “six more weeks of winter” would be correct two thirds of the time, so it’s only 10–15% better than always picking “six more weeks of winter.”

None of the models for the other months performed any better than pure chance.

Unfortunately, it doesn’t look like groundhogs are very good at predicting the weather. Not that we shouldn’t keep them around. At least they’re more adorable than the average weather reporter. And if you’re forced into your office groundhog pool, now you know to bet on Stormy Marmot or Poor Richard.

Thanks to Abigail Pope-Brooks and Paula Seligson for editing, and for reminding me that although I can make funny numbers come out of a computer, explaining those numbers is hard. NP-hard. (Note: This paragraph was unedited. All typoes are my own.)

All the code and data used is available on github. Jeremy Neiman can be found doing equally pointless things on his website: