Statistics is a Weird Concept

Published in

The Renaissance Economist

17 min readAug 17, 2021

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is one of the backbones of scientific inquiry.

The confirmation of the Higgs Boson’s existence in 2012, as an example from the physical sciences, was based on statistical methods. From the Smithsonian Magazine article about the topic, “the evidence that the Higgs particle had been detected was strong enough to cross the threshold of discovery.” That “threshold of discovery”? Yeah, it’s a statistical concept called a “p-value” and Scientific American wrote an article about a related statistical concept titled “5 Sigma What’s That?” and well, yeah, what is that?

As an example from the social sciences, consider the literature on immigration to the United States and its effect on economic growth. In a paper titled “STEM Workers, H-1B Visas, and Productivity in US Cities”, authors Giovanni Peri, Kevin Shih, and Chad Sparber state that they “use the 1980 distribution of foreign-born STEM workers and variation in the H-1B visa program to identify supply-driven STEM increases across cities.” They find: “Increases in STEM workers are associated with significant wage gains for college-educated natives. Gains for non-college-educated natives are smaller but still significant,” and their results “imply that foreign STEM increased total factor productivity growth in US cities.” Where do these results come from? You guessed it, statistics.

Before we proceed, we should distinguish between two very high-level areas of statistics. The first is related to the kind described in the examples above and is called “statistical inference”. This kind is about looking backward at some past event and trying to understand the inputs to that event and possibly draw larger conclusions about how those types of events work. Within the realm of statistical inference is “causal inference” which attempts to ascertain causal inputs to an event rather than just associated inputs. The second kind is about the future and how to predict what will happen. This second kind is often termed “predictive analytics”. In this article we will be focusing on the first kind since we are interested in how to think about the application of statistics in the physical and social sciences.

My academic background originated in an engineering flavored Applied Mathematics degree that focused on solving systems of equations, or using algorithms to solve a problem. It was very computationally intensive and focused more on the future; how can we use mathematics to design efficient operations, for instance. Statistics is inherently backward looking, using data to understand the past.

The canonical example of how to think about these things is shown below.

In an engineering context, we will start with some known properties, e.g. we know that this is a fair coin, and we want to determine what we expect to happen, what will be the distribution of heads we will expect to see for some number of consecutive tosses of this coin.

In a statistics context we start with some observed number of heads and we want to determine what the probability of a head is for the coin that was tossed.

The two directions are related based on the underlying probability framework that connects the probability of a head for the coin and the observed distribution of head tosses.

While this is the canonical example, I find it not to be very instructive about how to think about the statistics used in actual scientific research. In my experience, since most instruction in this area starts off teaching the necessary concepts in probability before introducing statistics, which can take some time to get through, not many people actually get an introduction to the core of statistical thinking and it isn’t necessary to know all the intricacies of complex probability to understand how to think like a statistician.

In this article I will take what should be a familiar problem from the engineering world and frame it in a statistical world, and I think in doing so it will become clear that statistics is a somewhat conceptually strange thing. That example: throwing a ball.

Let’s make it a bit more interesting by drafting a narrative. Say I called a friend of mine and told him to meet me during the winter in a park that recently received a snowfall. I tell my friend to meet me at the statue in the middle of the park, but this is actually all just a prank and what I really want to do is to throw a snowball at my friend but not get caught. In this hypothetical example I am the engineer and my friend is the statistician.

The Engineer

The engineering problem here is about how far away I can stand from my friend to throw the snowball so I can run away without getting caught. I need to calculate this distance from the statue. There are 2 inputs I need to use to figure out the maximum distance I can throw from. These are how fast I can throw the snowball and the angle at which I throw the snowball.

My throwing strength will determine the speed at which I can throw the ball (v), while the angle (θ) at which I throw the ball will determine how much of my throwing strength goes into overcoming gravity which is pulling the ball down to the earth, and how much of my throwing strength goes into actually propelling the ball horizontally toward the statue (and my friend’s big head). These are decomposed as shown.

To determine my throwing strength I enlist the help of a friend and get him to measure how quickly I can throw a ball with a speed radar. We find that I can throw at 50 miles per hour or 22.35 meters per second. (Note: I didn’t actually do this, but I can probably throw somewhere around this fast.)

Now for determining the optimal angle to throw at. Fortunately for me, it is an easily provable and well known fact that to optimize horizontal distance I want to throw at an angle of 45 degrees relative to the ground. Without going into the math, it is the angle at which we balance horizontal throwing force with vertical throwing force such that we get the best of both worlds. Consider the 3 angles below. The red is thrown with the most vertical component of the force and goes highest, but doesn’t have as much horizontal distance covered. The green is thrown with the most horizontal force but returns to the ground so quickly that it doesn’t cover much distance in the air. The blue trajectory (45 degrees) is the best since it provides enough vertical force to keep the ball in the air for a while, but also enough horizontal force that it moves a substantial distance while in the air.

For those interested here are the relevant formulas:

I plug these two numbers into the distance formula, or an online calculator like the one found here, and I find that I can throw my snowball 50.9 meters/167 feet. My friend better watch out for chance of snow.

The Statistician

… *whomp* “Who threw that?”

My friend is out for blood, but he needs to figure out if he can catch the assailant before they get away.

My friend thinks about it quickly and decides if the person is within 105 meters/344 feet he can catch the assailant before they get away, but anything further and he’s out of luck; this crime will go unpunished. How can he figure it out with the information he has? He decides to use one of the powerhouse methods in statistical inference, the hypothesis test.

The concept is simple, he creates two hypotheses and assumes one of them to be true. This “true” hypothesis is falsifiable and is called the “Null Hypothesis” (somewhat synonymous with the idea of the “default hypothesis”). The other hypothesis is called the “Alternative Hypothesis” and is used as the complement to the Null. My friend uses the following two hypotheses:

Null Hypothesis: The assailant is further than 105 meters from the statue.

Alternative Hypothesis: The assailant is within 105 meters of the statue.

How does he test his Null Hypothesis? This is where actual statistics comes into play. My friend, for some unspecified reason, knows the distribution of human throwing strength offhand, specifically for how fast an adult human can throw a tennis ball. He then makes some assumptions.

The assailant throwing the ball is over the age of 18 (to help identify the correct distribution for throwing speed).
The snowball that hit him is of comparable weight and shape to that of a tennis ball (specifically that it will have the same throwing properties of a tennis ball, else the distribution he has does not apply).
The assailant throwing the ball did so at a 45 degree angle so as to maximize the distance between himself and my friend to make the get away more likely.

With these assumptions my friend backward calculates the distance the assailant had to be standing with these assumptions.

In order to test our Null Hypothesis, we want to examine whether the assailant could be standing at or beyond 105 meters from the statue. Since the closest distance for this hypothesis is 105 meters, let us calculate the speed at which the assailant must throw the snowball to reach my friend at the statue. This is the minimum speed which would follow from this hypothesis.

We calculate that the assailant would have to throw the snowball at a minimum of 32.09 m/s or 71.63 miles per hour if this hypothesis is true. Where does this throwing speed fit into the distribution of human throwing speeds for a tennis ball?

The graph shows the distribution of adult human throwing speed for a tennis ball (this is made up based on data I found online but I’ve lost the link to the website). The blue line represents the distribution of human throwing speeds. For anyone not familiar with probability density graphs, the easiest way to think about this is that the higher the blue line is in a specific area, the more people tend to throw at that speed. We see that the minimum throwing speed to support our Null Hypothesis of 71.63 miles per hour is quite high among the range of throwing speeds, and that not many people tend to throw at this speed (by the low position of the blue line at this speed).

Now this is where the hypothesis *test* comes in. Ultimately, we want to test how *likely* the result we observe is if the Null Hypothesis is true. We pick some probability threshold, usually denoted α (alpha), which we use as the “significance level” of the test. This significance level is the threshold likelihood for the result we observe under our Null Hypothesis. In the Social Sciences that threshold is typically 5% so we will use that here. In other words, if our Null Hypothesis that the assailant threw the snowball from 105 meters or further is true, then is the likelihood of a throwing speed of 71.63 miles per hour or more, less than or greater than 5%?

To test this, we look at the distribution of human throwing speeds. We look at the side of the distribution we are interested in, the faster side, and then we identify the area that corresponds to 5% likelihood. This area is shown in red.

We see that the minimum throwing speed of 71.63 miles per hour is within the 5% likelihood area. This means that the likelihood of the result my friend observed (getting clobbered in the head by a snowball) from 105 meters or further away is less than 5%. Since the likelihood that the assailant is more than 105 meters away is less than 5%, my friends thinks that is enough statistical evidence that the assailant must be within 105 meters and takes off in the direction the snowball came from. In statistics speak, he rejects the Null Hypothesis in favor of the Alternative Hypothesis.

Luckily for me, by the time my friend has finished running the numbers I’m long gone.

That’s one of the core ideas of statistical inference in a nutshell: formulate Null and Alternative Hypotheses about a phenomenon, observe data, and if under the Null Hypothesis the observed data is quite unlikely, i.e. it passes some threshold of unlikeliness, reject the Null in favor of the Alternative.

To return to the idea of the “p-value” from the introduction: the p-value is the likelihood under the Null of the observed value or a more extreme value. In our graph, since 71.63 miles per hour is clearly within the 5% probability area, this must mean that the probability of a thrower with this throwing strength or faster, based on the distribution of human throwing speeds for a tennis ball, must be less than 5%; the p-value of this result must be less than 0.05.

Multiples of sigma can be thought of as an analogue to the p-value. Specifically, “sigma” refers to a specific type of distance from the mean/average of the distribution, so 5 sigma is another way of saying we are so far into the tail of the distribution that the p-value is very very small and therefore satisfies a very strict threshold of unlikeliness under the Null.

With that thread closed, we can now talk about why this is a weird concept.

The Weirdness

There are two ideas I think are weird in statistics.

The first is about the hypotheses.

In our narrative example, our hypotheses were about the maximum potential distance at which my friend could potentially catch his snowy assailant, not about the actual location of the assailant. That was not just for narrative purposes; statistical hypotheses often take these forms. The Null and Alternative Hypotheses must be complementary, so if one uses a condition with less than, then the other must use greater than or equal to. If one uses an equality (typically in the Null), then the other must use inequality (which could be either greater than or less than the value in the complementary hypothesis). It is not precise, and does not hope to be.

These statistical arguments, in the form of hypotheses, propose a general categorization of reality, not a precise one. In truth, statistics cannot identify a precise statement due to the nature of randomness. In probability and statistics we think of observed outcomes as “realizations” or single instances that are coming from a probabilistic generating distribution. For instance, say we have a coin with a probability of heads p. If we flip it once and we get a heads, does that mean p is greater than 0.5, i.e. the coin is biased towards heads? What if we flip it 4 times and get heads all 4 times? If the coin was fair, we would expect this outcome 6.25% of times we make 4 consecutive flips of the coin, but if the coin was biased toward heads with p=0.7 we would expect this outcome 24.01% of times. With this information, the best we can hope to say is that the observed “realization” of the outcome is about 4 times more likely if the coin is biased with p=0.7 than if the coin is fair with p=0.5, but we cannot say with absolute certainty that the coin is biased, and even if we knew it is biased, we could not say with absolute certainty that the bias is p=0.7.

The second is about assumptions.

In the snowball narrative above, my friend the Statistician made a number of assumptions in order to test his hypotheses. There are two general buckets for the assumptions.

The first bucket contains assumptions related to the calculation needed to test the hypothesis. These are assumptions like no friction (which would otherwise change the calculation), that the snowball operates with similar aerodynamics as a tennis ball, and no wind.

The second bucket contains assumptions about the situation itself and is more interesting. An example of an assumption in this bucket is that a typical adult human threw the snowball. In the narrative I painted, we have said that a typical adult human threw the snowball, but my friend doesn’t know this. What if the assailant was actually a baseball player, for whom a throwing speed of 80 miles per hour is fairly common? What if there was no human thrower at all? I could have alternatively been piloting a drone from the comfort of my home to drop a snowball directly on my friend’s head from up above. My friend’s modeling of the situation does not include a way to incorporate these alternatives.

The first bucket is what the bulk of the mathematical complexity in statistical inference is meant to address. Statisticians use mathematics to establish the properties of the probability distributions we use in statistical inference, and in my experience the bulk of statistics instruction tends to focus on this area, with good reason; understanding these properties is critical to getting the mechanics of statistics correct. The assumptions underpinning the derived mathematical properties may not always be perfectly met in real data, but at least we can often test how closely they are being met.

The second bucket is much more difficult to test. This bucket is about the models that statisticians and researchers use to identify what is actually happening in the real world they are hoping to understand; it is much more difficult to know if we are getting this right. The field of Causal Inference is especially focused on understanding how we can accurately model the world around us and identify causal effects using rigorous methodology, but methodology has its limits as there is only so much variation that is possible to control for in the model.

These two “weird” elements of statistics are critical to understanding science, and in particular its limitations. They also establish a difference between the “hard”/physical sciences and the “soft”/social sciences.

The physical sciences often have much more control over the environments studied which largely reduces the potential for error in both the first and second assumptions. Experiments in the physical sciences can be re-run under the same conditions many many times which allows for much more data and this in turn allows researchers in this area the ability to have a very high threshold for discovery (5 sigma for instance). Experiments in the physical sciences can also be very highly controlled so that every variable is accounted for (there is no uncertainty about whether it is a human assailant or a drone). This means that the modeling is more complete.

The social sciences are often the opposite. Experiments in the social sciences are often very difficult to repeat, and even more so to repeat with exactly the same conditions. This leads to smaller sample sizes relative to the physical sciences and correspondingly the threshold for discovery is often lower, typically a 5% significance level (about 2 sigma). These experiments are also often “natural experiments” or otherwise subject to variability that cannot be fully controlled for by the researcher. These two problems can contribute to disparities in research findings for similar research questions. In fact, these two problems can sometimes lead to variations in research findings for similar research questions using the same dataset! Researchers may have different modeling assumptions in answering the same question and this difference can lead to different results.

Consider the question of whether immigrants to the United States depress wages for US natives. Economists are mixed on this question. Leading immigration scholar George Borjas of Harvard University has found evidence that immigrants do indeed depress wages, while leading labor market scholar David Card of UC Berkeley disagrees, instead suggesting there is no material impact. Alan de Brauw of the Cato Institute examines the disagreement in an article from Fall 2017.

When economists examine the effect of immigration on wages they use a measure called the “wage elasticity of immigration” which captures the percentage change in wages for a percentage change in immigration. If immigration really does depress wages then this value should be negative so that an increase in immigration would lead to a decrease in wages. In Card’s research he finds that there is no effect on wages from immigration while in Borjas’ research he finds the wage elasticity of immigration is between -0.3 and -0.4, i.e. if immigration increased by 1%, wages would decrease between 0.3 and 0.4 percent. Where might this disagreement be coming from? Assumptions and data.

De Brauw summarizes the assumptions aspect very well in the article:

One reason for the substantial disagreement about the effects of immigration on wages is that there are several different empirical methods that economists use to study the wage relationship even though there is a relatively uniform theory (Altonji and Card 1991). Each empirical method uses a different way of measuring variation in the quantity of immigrants to generate wage estimates. Dustmann, Schonberg, and Stuhler (2016) summarize the three major methods as (1) the national skill-cell approach, (2) the pure spatial approach, and (3) a mixed approach. The national skill-cell approach uses variation in the entry of immigrants to different education-experience groups within the national population. The pure spatial approach uses variation in the immigrant flow across cities or regions. The mixed approach uses variation in immigration inflows across both education groups and regions.
Different methodologies produce different elasticities. The skill-cell approach generates large estimates of the wage elasticity of immigration (Borjas 2003, Borjas and Katz 2007, Borjas 2014). Meanwhile, the pure spatial approach leads to estimates that vary substantially (Card 1990, 2009; Boustan, Fishback, and Kantor 2010). Finally, the mixed approach tends to lead to smaller, negative elasticity estimates that are not always significantly different from zero (Borjas 2006, Card and Lewis 2007, Lewis 2011). The method used to estimate the variation in the immigrant population is vital to producing the wage elasticity of immigration estimate. Dustmann, Schonberg and Stuhler (2016) argue that due to the difference in measuring variation between studies, estimates derived from the spatial approach and the skill-cell approach are fundamentally different models, even if derived from the same underlining theory.

In essence, the methodology used relates to our second bucket of assumptions about the nature of the real world. Is the effect of immigration on wages, if one exists, captured in a model that matches similar skill levels and assumes perfect substitutability within skill levels, or does the effect materialize in a model about the spatial/geographic aspect of immigration, or some combination? Much like we argued earlier of assumptions in this bucket this is very hard to test, but as we also argued, these assumptions have material impacts on whether we are getting to the correct understanding of the situation; my friend can run after his assailant if he concludes a human thrower has to be closer than 105 meters, but he isn’t going to find much success if in reality it was a drone that dropped the snowball on his head.

This is not to say that we should discount research using statistical methods, but we need to understand the limits of the statistical methods researchers use, and therefore also the degree to which we can trust a piece of the literature on the topic. Because of these limitations, it is important not to put too much weight on any one study but instead look toward patterns of recurring consistent results to draw a conclusion.

Coming from an engineering background where the focus is all about knowing how systems work, how to use known properties to build solutions, and being precise in our results, statistics is a weird concept.

You can find me @bradchattergoon on Twitter and LinkedIn.

Statistics is a Weird Concept

The Engineer

The Statistician

The Weirdness

Written by Brad Chattergoon